Longhorn PVC failed to switch to different pod, once pod instance died

I was trying to check longhorn HA by doing one experiment on my cluster. Here the details what I am trying to do:
I am initiating one workload and in the volume section I am attaching longhorn PVC as a storage volume as 3 replicas, but whenever I turned off the instance(with IP 10.x.x.A) on which that workload Pod is running it, Either it won’t able to switch to different node (because it is failing to detach and attach PVC storage) though it has option on running other nodes in the cluster i.e. 10.x.x.B and 10.x.x.C. or

if I turned on the instance(with IP 10.x.x.A) that Pod is not able to run successfully until I manually redeploy it, though it is trying to switch from A to B but it is failing in attaching the PVC storage.
It look like something this:

Above scenario happens most of the time and sometimes it happen to do the things correctly. Any help or information is highly appreciated.
Longhorn Version: 0.5.0
Rancher Version : 2.2.4

Thanks and Regards
Chetan Gupta

Hi @chaets

Currently Kubernetes CSI driver has issues when handling Kubernetes node failures. You can refer to https://github.com/longhorn/longhorn/blob/master/docs/node-failure.md . We’re working with Kubernetes upstream to try to solve it.

1 Like

Hi, I read this post and just want to add a couple of comments.
Im running rancher/rke/longhorn and I’ve put a lower value on:
default-not-ready-toleration-seconds: 30
default-unreachable-toleration-seconds: 30

To me it seems that statefulset works fine on node failure when setting the terminationGracePeriodSeconds: 0
(In my scenario it’s OK)
Unsure if it’s possible to have longhorn to respect this config? (still takes ~5 min to mount vol on new node)


1 Like

Thanks for the comment @hwaastad .

You meant by setting those and terminationGracePeriodSeconds:0, the failed stateful set pod will be deleted itself immediately after the node is down? The 5 minutes you mentioned is probably from the Kubernetes CSI side and as long as the pod has been deleted automatically and the new pod was created, we may able to improve on that.