Longhorn PVC failed to switch to different pod, once pod instance died

chaets · August 14, 2019, 4:38pm

Hi,
I was trying to check longhorn HA by doing one experiment on my cluster. Here the details what I am trying to do:
I am initiating one workload and in the volume section I am attaching longhorn PVC as a storage volume as 3 replicas, but whenever I turned off the instance(with IP 10.x.x.A) on which that workload Pod is running it, Either it won’t able to switch to different node (because it is failing to detach and attach PVC storage) though it has option on running other nodes in the cluster i.e. 10.x.x.B and 10.x.x.C. or

if I turned on the instance(with IP 10.x.x.A) that Pod is not able to run successfully until I manually redeploy it, though it is trying to switch from A to B but it is failing in attaching the PVC storage.
It look like something this:

Above scenario happens most of the time and sometimes it happen to do the things correctly. Any help or information is highly appreciated.
Longhorn Version: 0.5.0
Rancher Version : 2.2.4

Thanks and Regards
Chetan Gupta

yasker · August 14, 2019, 6:40pm

Hi @chaets

Currently Kubernetes CSI driver has issues when handling Kubernetes node failures. You can refer to https://github.com/longhorn/longhorn/blob/master/docs/node-failure.md . We’re working with Kubernetes upstream to try to solve it.

hwaastad · August 19, 2019, 8:03am

Hi, I read this post and just want to add a couple of comments.
Im running rancher/rke/longhorn and I’ve put a lower value on:
default-not-ready-toleration-seconds: 30
default-unreachable-toleration-seconds: 30

To me it seems that statefulset works fine on node failure when setting the terminationGracePeriodSeconds: 0
(In my scenario it’s OK)
Unsure if it’s possible to have longhorn to respect this config? (still takes ~5 min to mount vol on new node)

/hw

yasker · September 4, 2019, 10:51pm

Thanks for the comment @hwaastad .

You meant by setting those and terminationGracePeriodSeconds:0, the failed stateful set pod will be deleted itself immediately after the node is down? The 5 minutes you mentioned is probably from the Kubernetes CSI side and as long as the pod has been deleted automatically and the new pod was created, we may able to improve on that.

Topic		Replies	Views
Longhorn - workload pod moved - storage did not Longhorn	2	1516	July 1, 2021
PVC mount fails Longhorn	0	431	December 2, 2023
Pod stucks when recreates at another node Longhorn	0	441	August 9, 2023
General questions about Longhorn and scale Longhorn	20	7095	February 24, 2024
The pvc FailedMount the node Longhorn	1	2928	June 13, 2022

Longhorn PVC failed to switch to different pod, once pod instance died

Related topics