When a node crashes, why are workloads not moved to healty nodes?


I have a Rancher 2.2.2 cluster with three workload nodes. When I woke up, I was alerted one node lost connection with the rest of the cluster. Yet, after many hours, lots of workloads were down and not moved to the healthy nodes.

Why were the workloads not re-launched on the healthy nodes?


Im wondering the same thing, I dont think its limited to 2.2.2 its never moved services when nodes died for us.

1 Like

IIRC this is basic Kubernetes. The local kubelet on each worker node manages the pods, and when the node dies, Rancher/K8s will report the last known information from the kubelet, which in this case lists 16 pods. You can rest assured those 16 pods are not actually running since the host is down, Rancher is just reading stale information from the kubernetes cluster.

However, since those pods are inaccessible and thus their liveness probes (and readiness probes) intrinsically fail, the higher deployment wrapping these pods will have scheduled their replacements on other surviving nodes (the ReplicaSet and similar) and any services backed by them will have unenrolled the dead pods’ internal IP addresses from the endpoint groups. Once the missing node comes back online those counters should update I think.

I guess reading into your first post though, did kubernetes not schedule new replacements for those pods on surviving nodes? These aren’t one-off pods you created outside of the deployment construct right?


in our clusters, when the kubelet/node dies more often the not the services on that node also silently die, or go into an unknown state and just hang forever until we manually replace the node. This happens several times per week. sometimes our nodes die and don’t report any problems we just notice services not working at all its like playing whack - a - mole


I’m wondering if your pods (that are not re-scheduled to other nodes when their host node dies) have liveleness probe enabled.


@Dmitry_Shultz Ahh yes, these pods are workers with no HTTP server and we never got around to making a livleness probe for them. This makes sense now.


@greg_keys1 Can you elaborate on “This happens several times per week”? Is your kubernetes cluster that unstable? Is this an RKE cluster? What version rancher are you driving? Do you have the “one click” prometheus enabled?