When a node crashes, why are workloads not moved to healty nodes?

I have a Rancher 2.2.2 cluster with three workload nodes. When I woke up, I was alerted one node lost connection with the rest of the cluster. Yet, after many hours, lots of workloads were down and not moved to the healthy nodes.

Why were the workloads not re-launched on the healthy nodes?

Im wondering the same thing, I dont think its limited to 2.2.2 its never moved services when nodes died for us.

1 Like

IIRC this is basic Kubernetes. The local kubelet on each worker node manages the pods, and when the node dies, Rancher/K8s will report the last known information from the kubelet, which in this case lists 16 pods. You can rest assured those 16 pods are not actually running since the host is down, Rancher is just reading stale information from the kubernetes cluster.

However, since those pods are inaccessible and thus their liveness probes (and readiness probes) intrinsically fail, the higher deployment wrapping these pods will have scheduled their replacements on other surviving nodes (the ReplicaSet and similar) and any services backed by them will have unenrolled the dead pods’ internal IP addresses from the endpoint groups. Once the missing node comes back online those counters should update I think.

I guess reading into your first post though, did kubernetes not schedule new replacements for those pods on surviving nodes? These aren’t one-off pods you created outside of the deployment construct right?

in our clusters, when the kubelet/node dies more often the not the services on that node also silently die, or go into an unknown state and just hang forever until we manually replace the node. This happens several times per week. sometimes our nodes die and don’t report any problems we just notice services not working at all its like playing whack - a - mole

I’m wondering if your pods (that are not re-scheduled to other nodes when their host node dies) have liveleness probe enabled.

@Dmitry_Shultz Ahh yes, these pods are workers with no HTTP server and we never got around to making a livleness probe for them. This makes sense now.

1 Like

@greg_keys1 Can you elaborate on “This happens several times per week”? Is your kubernetes cluster that unstable? Is this an RKE cluster? What version rancher are you driving? Do you have the “one click” prometheus enabled?

I’ve seen this same scenario where a node is Unavailable (cannot even SSH into the node) and yet the workloads do not get redeployed to other workers.

Are you using a small EBS volume on your EC2 instances by any chance?

EBS volumes under 1TB have a Burst Balance which is drawn down on when the disk i/o bursts above the number of IOPS that were provisioned. If you check the monitoring of the EBS volume you can see if the Burst Balance has ever been drawn down to 0. If it has, that’s often a cause of nodes seeming to “crash” because the disk i/o becomes very slow. Provisioning a larger EBS volume can work around that issue because the larger the volume the larger the base IOPS available to it.

Hi,
do you by chance have statefulsets?
I think (not considering persistent storage) statefulsets will not be rescheduled in rancher 2.2.x (at least from my testing) from a node failure (NotReady). Deployments are.

EDIT: regarding statefulsets, it’s by spec.
if you want rescheduling for at statefulset on a node failure you need; terminationGracePeriodSeconds: 0

I’ve experienced same issue; although, my nodes did not crash. I was testing by shutting down the nodes to see what happens to pods that were running on them. Basically, what I was looking to achieve is, in-case of a node failure, I want the pods to migrate to another node…

I had stumbled upon “–pod-eviction-timeout” setting, by default, it is 5-minutes, and I want to reduce that… But I cannot get it to work. Not sure if I am setting it correctly?

I will look at “terminationGracePeriodSeconds”. But, anyone has an idea of where to set this in Rancher UI? Or do I need to set this in a config file?

Thank you!

@dc.901 “terminationGracePeriodSeconds” will be in the deployment YAML file.
You can edit it from the UI by going on the … next to the deployment and clicking “View/edit YAML”.

image

Then search (ctrl-f or cmd-f) for it. The first few occurences at the top are rancher specific metadata and you dont want to touch those, you need to find the YAML key itself, as per the screenshot below (ignore the highlighting its what my browser search had matched):

I shut off a node “kubelet stopped posting node status” - all the workloads were unavailable but they weren’t rescheduled onto another node. Any ideas why?