Volumes constantly degraded, replicas "flapping"

scubbo · July 21, 2023, 7:15am

I have tried installing Longhorn on a 4-node k3s cluster - one x86 PowerEdge (control-plane), 3 Raspberry Pi 4s. Longhorn storage for the PowerEdge is on an SSD, and for the Pis is on attached powered HDDs (one per Pi). All Longhorn pods are running with no errors/restarts, but all volumes report as degraded and any replicas that aren’t on the control-plane node quickly error out, are recreated, and error out again:

$ kubectl get replicas.longhorn.io -n longhorn-system
NAME                                                  STATE      NODE       DISK                                   INSTANCEMANAGER                                     IMAGE                               AGE
pvc-58e326c8-3777-42ad-89cf-9eacf0ba0fb7-r-6bb2c2a1   running    epsilon    16a8c5dc-4b44-4100-94f6-0995c0a85b8b   instance-manager-c8e888869e140a5618feabb01783baaa   longhornio/longhorn-engine:v1.5.0   14h
pvc-f7fda1f2-897b-4310-8548-c1ad64040ea1-r-adad8378   running    epsilon    16a8c5dc-4b44-4100-94f6-0995c0a85b8b   instance-manager-c8e888869e140a5618feabb01783baaa   longhornio/longhorn-engine:v1.5.0   4d3h
pvc-acd10539-0474-49b6-837b-b985c65b8925-r-08d9fd13   running    epsilon    16a8c5dc-4b44-4100-94f6-0995c0a85b8b   instance-manager-c8e888869e140a5618feabb01783baaa   longhornio/longhorn-engine:v1.5.0   4d3h
pvc-47c0f6df-3b36-456c-a650-991d7131fa82-r-047c7b26   running    epsilon    16a8c5dc-4b44-4100-94f6-0995c0a85b8b   instance-manager-c8e888869e140a5618feabb01783baaa   longhornio/longhorn-engine:v1.5.0   4d4h
pvc-747fcde6-1e35-48de-9b30-74d601870c43-r-57e62375   stopped                                                                                                                                              8m9s
pvc-747fcde6-1e35-48de-9b30-74d601870c43-r-f4cbbf91   stopped    rasnu1     085c5398-ff0d-4aae-a3ee-2a9133e1c564                                                                                           8m9s
pvc-747fcde6-1e35-48de-9b30-74d601870c43-r-6b053823   stopped    rassigma   c5f7d455-1486-4015-8e3f-2f6d5112d11b                                                                                           8m9s
pvc-f7fda1f2-897b-4310-8548-c1ad64040ea1-r-538a3bb8   stopped    rassigma   c5f7d455-1486-4015-8e3f-2f6d5112d11b                                                                                           3s
pvc-58e326c8-3777-42ad-89cf-9eacf0ba0fb7-r-4e1c8d50   stopped    rassigma   c5f7d455-1486-4015-8e3f-2f6d5112d11b                                                                                           2s
pvc-acd10539-0474-49b6-837b-b985c65b8925-r-9f12f6ee   stopped    rassigma   c5f7d455-1486-4015-8e3f-2f6d5112d11b                                                                                           2s
pvc-47c0f6df-3b36-456c-a650-991d7131fa82-r-ab2b9023   stopped    rassigma   c5f7d455-1486-4015-8e3f-2f6d5112d11b                                                                                           2s
pvc-f7fda1f2-897b-4310-8548-c1ad64040ea1-r-fea692f8   stopped    rasnu1     085c5398-ff0d-4aae-a3ee-2a9133e1c564                                                                                           2s
pvc-58e326c8-3777-42ad-89cf-9eacf0ba0fb7-r-18bd95b9   starting   rasnu1     085c5398-ff0d-4aae-a3ee-2a9133e1c564   instance-manager-754ee9c224195750355753939287ef17                                       2s
pvc-acd10539-0474-49b6-837b-b985c65b8925-r-1a9ef538   starting   rasnu1     085c5398-ff0d-4aae-a3ee-2a9133e1c564   instance-manager-754ee9c224195750355753939287ef17                                       3s
pvc-47c0f6df-3b36-456c-a650-991d7131fa82-r-44a185a6   error      rasnu1     085c5398-ff0d-4aae-a3ee-2a9133e1c564   instance-manager-754ee9c224195750355753939287ef17                                       2s

[...wait a few seconds...]

$ kubectl get replicas.longhorn.io -n longhorn-system
NAME                                                  STATE     NODE       DISK                                   INSTANCEMANAGER                                     IMAGE                               AGE
pvc-58e326c8-3777-42ad-89cf-9eacf0ba0fb7-r-6bb2c2a1   running   epsilon    16a8c5dc-4b44-4100-94f6-0995c0a85b8b   instance-manager-c8e888869e140a5618feabb01783baaa   longhornio/longhorn-engine:v1.5.0   14h
pvc-f7fda1f2-897b-4310-8548-c1ad64040ea1-r-adad8378   running   epsilon    16a8c5dc-4b44-4100-94f6-0995c0a85b8b   instance-manager-c8e888869e140a5618feabb01783baaa   longhornio/longhorn-engine:v1.5.0   4d3h
pvc-acd10539-0474-49b6-837b-b985c65b8925-r-08d9fd13   running   epsilon    16a8c5dc-4b44-4100-94f6-0995c0a85b8b   instance-manager-c8e888869e140a5618feabb01783baaa   longhornio/longhorn-engine:v1.5.0   4d3h
pvc-47c0f6df-3b36-456c-a650-991d7131fa82-r-047c7b26   running   epsilon    16a8c5dc-4b44-4100-94f6-0995c0a85b8b   instance-manager-c8e888869e140a5618feabb01783baaa   longhornio/longhorn-engine:v1.5.0   4d4h
pvc-747fcde6-1e35-48de-9b30-74d601870c43-r-57e62375   stopped                                                                                                                                             8m51s
pvc-747fcde6-1e35-48de-9b30-74d601870c43-r-f4cbbf91   stopped   rasnu1     085c5398-ff0d-4aae-a3ee-2a9133e1c564                                                                                           8m51s
pvc-747fcde6-1e35-48de-9b30-74d601870c43-r-6b053823   stopped   rassigma   c5f7d455-1486-4015-8e3f-2f6d5112d11b                                                                                           8m51s
pvc-58e326c8-3777-42ad-89cf-9eacf0ba0fb7-r-64b001c3   stopped                                                                                                                                             1s
pvc-47c0f6df-3b36-456c-a650-991d7131fa82-r-4755af1d   error     rasnu1     085c5398-ff0d-4aae-a3ee-2a9133e1c564   instance-manager-754ee9c224195750355753939287ef17                                       2s
pvc-acd10539-0474-49b6-837b-b985c65b8925-r-f4e63118   stopped   rasnu1     085c5398-ff0d-4aae-a3ee-2a9133e1c564                                                                                           2s
pvc-f7fda1f2-897b-4310-8548-c1ad64040ea1-r-48f0d951   stopped                                                                                                                                             1s
pvc-58e326c8-3777-42ad-89cf-9eacf0ba0fb7-r-b9e06832   error     rassigma   c5f7d455-1486-4015-8e3f-2f6d5112d11b   instance-manager-c0b889266e751abfd03088e8529cacb0                                       3s
pvc-f7fda1f2-897b-4310-8548-c1ad64040ea1-r-734c1ea0   error     rassigma   c5f7d455-1486-4015-8e3f-2f6d5112d11b   instance-manager-c0b889266e751abfd03088e8529cacb0                                       3s
pvc-47c0f6df-3b36-456c-a650-991d7131fa82-r-f0cc49c8   error     rassigma   c5f7d455-1486-4015-8e3f-2f6d5112d11b   instance-manager-c0b889266e751abfd03088e8529cacb0                                       3s
pvc-acd10539-0474-49b6-837b-b985c65b8925-r-887183c1   stopped   rassigma   c5f7d455-1486-4015-8e3f-2f6d5112d11b                                                                                           1s

One of the Pi’s disks is nearly full (with non-Longhorn data) so I’ve disabled it, but there’s plenty of room on the others:

Logs from longhorn-manager pod do show some recurring errors, but I’m not sure how to respond to either:

time="2023-07-21T07:07:22Z" level=debug msg="Requeue volume due to error <nil> or Operation cannot be fulfilled on replicas.longhorn.io \"pvc-58e326c8-3777-42ad-89cf-9eacf0ba0fb7-r-e645e973\": the object has been modified; please apply your changes to the latest version and try again" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=epsilon owner=epsilon state=attached volume=pvc-58e326c8-3777-42ad-89cf-9eacf0ba0fb7
...
time="2023-07-21T07:07:22Z" level=error msg="There's no available disk for replica pvc-747fcde6-1e35-48de-9b30-74d601870c43-r-57e62375, size 536870912000"
time="2023-07-21T07:07:22Z" level=warning msg="Failed to schedule replica" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=false node=epsilon owner=epsilon replica=pvc-747fcde6-1e35-48de-9b30-74d601870c43-r-57e62375 shareEndpoint= shareState=stopped state=detached volume=pvc-747fcde6-1e35-48de-9b30-74d601870c43

Similar lines reoccur for replica IDs.

I’ll generate a Support Bundle now.

EDIT: I tried uploading the Support Bundle (~600M) to Git LFS, but that was apparently too large. Any suggestions on alternative upload locations welcome.
EDIT2: GitHub issue opened here

scubbo · July 25, 2023, 12:21am

Resolved. Softlinks were used on the (bare-metal) hosts to create a unified filesystem interface to (otherwise disparately-named) mountpoints, and (according to this issue) softlinks do not play nice with Longhorn’s Instance Managers. I replaced the softlinks with mount --rbind, and Volumes are now healthy.

Topic		Replies	Views
Longhorn volume is degraded, while all replicas are running Longhorn	3	9412	July 29, 2019
Longhorn dropping pods Longhorn	3	3424	November 4, 2021
Longhorn Volume keep faulted on AWS but no issue on baremetal Longhorn	0	779	September 20, 2022
Faulted Longhorn Volumes After Kubernetes Upgrade on DigitalOcean Longhorn	0	20	December 5, 2024
Longhorn volume degraded, Replica Scheduling Failure, Error Message: precheck new replica failed Rancher	1	152	December 10, 2024

Volumes constantly degraded, replicas "flapping"

Related topics