Pods doesn't automatically recover after failing instance

Hello everyone,

I have a question regarding the technical working of Rancher. The main advantage we thought to accomplish with Rancher and Kubernetes was, that whenever there is an instance of an application failing due to whatever reason, rancher automatically deploys another one so that you have no downtime.

In our situation however, we regularly experience a situation where the application stops and does not automatically redeploy a new pod and removes the ‘broken’ pod.

But in some situations (the reason is unknown), the whole service suddenly stops working, because one pod is failing. So that is the first thing that we don’t understand: why is the whole service unavailable while there is still one active and running? See screenshot below for the error situation:

The other question that we can not resolve is: why is rancher not automatically redeploying the failing pod? When I click on the button ‘redeploy’ in the service, the whole service works again in a few seconds. In my understanding, this should happen as soon as rancher detects that the service is unstable, but this can only be accomplished by manually clicking ‘redeploy’.

Hopefully somebody understands my problem and knows how to fix this. Thanks in advance!

Is there somebody who can help me

The picture shows one pod in a red/error state and another in blue/transitioning, which means neither one is healthy. They each have 2 containers in that pod, and one is healthy while the other is transitioning (spinner icon). Click on the workload to get more detail.

Replacement is done when a container outright dies. It is not done automatically if your app is still running but not responding correctly for whatever reason. To detect that you need a health check configured… If you have one make sure you’re using the right kind (liveness vs readiness), that it actually fails when the problem is occurring, that it passes on the replacement, and that you don’t have a circular dependency between the health checks is multiple services.

The scale in your picture says one, so there’s only supposed to be one running. If you don’t want the services to be down when something goes wrong they each need to have at least a scale of 2.

Probably the red one is failed and being removed, and the blue one is starting but not running/passing health check yet. If there is storage attached is also possible that the blue one has to wait for the red one to terminate before it can bind to the volume.

Thanks Vincent, I will go look in to this!