Pods doesn't automatically recover after failing instance

Martijnomoda · February 11, 2021, 9:37pm

Hello everyone,

I have a question regarding the technical working of Rancher. The main advantage we thought to accomplish with Rancher and Kubernetes was, that whenever there is an instance of an application failing due to whatever reason, rancher automatically deploys another one so that you have no downtime.

In our situation however, we regularly experience a situation where the application stops and does not automatically redeploy a new pod and removes the ‘broken’ pod.

But in some situations (the reason is unknown), the whole service suddenly stops working, because one pod is failing. So that is the first thing that we don’t understand: why is the whole service unavailable while there is still one active and running? See screenshot below for the error situation:

The other question that we can not resolve is: why is rancher not automatically redeploying the failing pod? When I click on the button ‘redeploy’ in the service, the whole service works again in a few seconds. In my understanding, this should happen as soon as rancher detects that the service is unstable, but this can only be accomplished by manually clicking ‘redeploy’.

Hopefully somebody understands my problem and knows how to fix this. Thanks in advance!

Martijnomoda · February 20, 2021, 7:24pm

Is there somebody who can help me

vincent · February 20, 2021, 7:47pm

The picture shows one pod in a red/error state and another in blue/transitioning, which means neither one is healthy. They each have 2 containers in that pod, and one is healthy while the other is transitioning (spinner icon). Click on the workload to get more detail.

Replacement is done when a container outright dies. It is not done automatically if your app is still running but not responding correctly for whatever reason. To detect that you need a health check configured… If you have one make sure you’re using the right kind (liveness vs readiness), that it actually fails when the problem is occurring, that it passes on the replacement, and that you don’t have a circular dependency between the health checks is multiple services.

The scale in your picture says one, so there’s only supposed to be one running. If you don’t want the services to be down when something goes wrong they each need to have at least a scale of 2.

Probably the red one is failed and being removed, and the blue one is starting but not running/passing health check yet. If there is storage attached is also possible that the blue one has to wait for the red one to terminate before it can bind to the volume.

Martijnomoda · February 23, 2021, 12:04pm

Thanks Vincent, I will go look in to this!

Topic		Replies	Views
Deleting a node -> redeploy pods automatically? Rancher	1	1831	June 19, 2019
Host failed but container not failing over Rancher 1.x	17	4054	January 12, 2016
Auto Restart and Re-Create on bad healthcheck not working Rancher 1.x	0	922	November 27, 2017
Rancher no auto restart, leave service in a failed state	1	1446	August 12, 2016
When a node crashes, why are workloads not moved to healty nodes? Rancher	11	4072	April 17, 2021

Pods doesn't automatically recover after failing instance

Related topics