Error after migration

Marx_N · October 14, 2021, 4:33pm

Hello, some time in the past after one of upgrade migration (I’m using currently 1.2.2) backup tab stopped to work, showing an error:

error listing backup volume names: Failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.2.2/longhorn [backup ls --volume-only nfs://omv:/backup/longhorn], output Failed to execute: ls [-1 /var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes], output ls: cannot access ‘/var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes’: Permission denied , error exit status 2 , stderr, time=“2021-10-14T16:29:34Z” level=warning msg=“failed to list first level dirs for path: backupstore/volumes reason: Failed to execute: ls [-1 /var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes], output ls: cannot access ‘/var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes’: Permission denied\n, error exit status 2” pkg=backupstore time=“2021-10-14T16:29:34Z” level=error msg=“Failed to execute: ls [-1 /var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes], output ls: cannot access ‘/var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes’: Permission denied\n, error exit status 2” , error exit status 1

Besides that, everything works as expected.

Here’s my deployment: https://github.com/Marx2/homelab/blob/main/cluster/core/longhorn/helm-release.yaml

Can you help me fix this problem?

Edit: I’m using 3 nodes
Follder /var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes exists only on one of them, and as a root I can’t access it:

root@longhorn-manager-chksw:/# ls -al /var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/
ls: cannot open directory ‘/var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/’: Permission denied

JenTing · October 15, 2021, 1:01am

Could you please go to the node and try to umount the NFS mount point manually, Then, longhorn will remount it when it needs to access the remote backup target.

Marx_N · October 15, 2021, 6:04am

Hi, I did it
Mount is:

/dev/mapper/pve-root on /var/lib/longhorn-setting type ext4 (ro,relatime,errors=remount-ro)
omv:/backup/longhorn on /var/lib/longhorn-backupstore-mounts/omv/backup/longhorn type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.42.1.61,local_lock=none,addr=192.168.1.230)

After unmounting I went to Dashboard and it got remounted automatically like this:

/dev/mapper/pve-root on /var/lib/longhorn-setting type ext4 (ro,relatime,errors=remount-ro)
omv:/backup/longhorn on /var/lib/longhorn-backupstore-mounts/omv/backup/longhorn type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.42.1.61,local_lock=none,addr=192.168.1.230)

Unfortunatelly gooing int backup tab still shows the same error. It’s also visible in manager’s logs:

time="2021-10-15T05:59:34Z" level=error msg="Error listing backup volumes from backup target" controller=longhorn-backup-target cred= error="error listing backup volume names: Failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.2.2/longhorn [backup ls --volume-only nfs://omv:/backup/longhorn], output Failed to execute: ls [-1 /var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes], output ls: cannot access '/var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes': Permission denied\n, error exit status 2\n, stderr, time=\"2021-10-15T05:59:34Z\" level=warning msg=\"failed to list first level dirs for path: backupstore/volumes reason: Failed to execute: ls [-1 /var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes], output ls: cannot access '/var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes': Permission denied\\n, error exit status 2\" pkg=backupstore\ntime=\"2021-10-15T05:59:34Z\" level=error msg=\"Failed to execute: ls [-1 /var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes], output ls: cannot access '/var/lib/longhorn-backupstore-mounts/omv/backup/longhorn/backupstore/volumes': Permission denied\\n, error exit status 2\"\n, error exit status 1" interval=5m0s node=wezyr url="nfs://omv:/backup/longhorn"

Can it be somehow connected with this bug?

github.com/longhorn/longhorn

[BUG] Longhorn 1.2.0 - wrong volume permissions inside container / broken fsGroup

opened 08:00AM - 01 Sep 21 UTC

closed 10:16PM - 08 Sep 21 UTC

bkupidura

kind/bug priority/0 require/auto-e2e-test require/doc kind/regression backport/1.2.1

**Describe the bug** After upgrade longhorn to 1.20, some container are unable …to start corectly (e.g prometheus). Looks like root cause is wrong Longhorn volume permisions inside container when container is not running as root. Even with `fsGroup` specified, permissions are not set for volume. **To Reproduce** ``` --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: broken-longhorn namespace: default spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi storageClassName: longhorn --- apiVersion: apps/v1 kind: Deployment metadata: name: broken-longhorn namespace: default labels: app.kubernetes.io/name: broken-longhorn spec: replicas: 1 strategy: type: Recreate selector: matchLabels: app.kubernetes.io/name: broken-longhorn template: metadata: labels: app.kubernetes.io/name: broken-longhorn spec: containers: - name: broken-longhorn image: ubuntu:focal-20210723 command: - "/bin/sh" - "-ec" - | tail -f /dev/null imagePullPolicy: IfNotPresent volumeMounts: - mountPath: /data name: data securityContext: runAsUser: 65534 runAsNonRoot: true runAsGroup: 65534 fsGroup: 65534 fsGroupChangePolicy: Always volumes: - name: data persistentVolumeClaim: claimName: broken-longhorn ``` ``` % kubectl get pvc broken-longhorn -o yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{},"name":"broken-longhorn","namespace":"default"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"1Gi"}},"storageClassName":"longhorn"}} pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io creationTimestamp: "2021-09-01T07:48:46Z" finalizers: - kubernetes.io/pvc-protection name: broken-longhorn namespace: default resourceVersion: "25249959" uid: 9ab5ca66-0794-4ad2-8aa5-73e96fa603fc spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi storageClassName: longhorn volumeMode: Filesystem volumeName: pvc-9ab5ca66-0794-4ad2-8aa5-73e96fa603fc status: accessModes: - ReadWriteOnce capacity: storage: 1Gi phase: Bound % kubectl get pv pvc-9ab5ca66-0794-4ad2-8aa5-73e96fa603fc -o yaml apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/provisioned-by: driver.longhorn.io creationTimestamp: "2021-09-01T07:48:48Z" finalizers: - kubernetes.io/pv-protection - external-attacher/driver-longhorn-io name: pvc-9ab5ca66-0794-4ad2-8aa5-73e96fa603fc resourceVersion: "25250154" uid: d4db7690-4763-45bf-a5c7-9b7ae0b5d584 spec: accessModes: - ReadWriteOnce capacity: storage: 1Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: broken-longhorn namespace: default resourceVersion: "25249908" uid: 9ab5ca66-0794-4ad2-8aa5-73e96fa603fc csi: driver: driver.longhorn.io volumeAttributes: fromBackup: "" numberOfReplicas: "3" staleReplicaTimeout: "30" storage.kubernetes.io/csiProvisionerIdentity: 1630473362980-8081-driver.longhorn.io volumeHandle: pvc-9ab5ca66-0794-4ad2-8aa5-73e96fa603fc persistentVolumeReclaimPolicy: Delete storageClassName: longhorn volumeMode: Filesystem status: phase: Bound % kubectl get csidriver driver.longhorn.io -o yaml apiVersion: storage.k8s.io/v1 kind: CSIDriver metadata: annotations: driver.longhorn.io/kubernetes-version: v1.20.7+k3s1 driver.longhorn.io/version: v1.2.0 creationTimestamp: "2021-08-31T17:47:21Z" name: driver.longhorn.io resourceVersion: "24953648" uid: 274cd12a-6aca-47a9-bfd8-32261eb5033a spec: attachRequired: true fsGroupPolicy: ReadWriteOnceWithFSType podInfoOnMount: true volumeLifecycleModes: - Persistent ``` ``` $ kubectl exec -t -i broken-longhorn-c4ccbbb6f-79djg -- bash nobody@broken-longhorn-c4ccbbb6f-79djg:/$ ls -la /data/ total 24 drwxr-xr-x 3 root root 4096 Sep 1 07:49 . drwxr-xr-x 1 root root 4096 Sep 1 07:49 .. drwx------ 2 root root 16384 Sep 1 07:49 lost+found nobody@broken-longhorn-c4ccbbb6f-79djg:/$ touch /data/test touch: cannot touch '/data/test': Permission denied ``` **Expected behavior** When `fsGroup` is provided, it should be used to chown destination mount. **Environment:** - Longhorn version: 1.20 - Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm - Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3os/k3s - Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): baremetal **Additional context** ``` % kubectl version Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:52:14Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7+k3s1", GitCommit:"aa768cbdabdb44c95c5c1d9562ea7f5ded073bc0", GitTreeState:"clean", BuildDate:"2021-05-20T01:07:13Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"} ```

I can’t test it, because I use Longhorn from Helm chart, and I don’t know, how to pass flag

    - --default-fstype=ext4

Marx_N · October 17, 2021, 11:48am

I’ve made this changen in Helm chart also, but doesn’t seem to work:

 values:
    persistence:
      defaultFsType: ext4

JenTing · October 20, 2021, 2:33am

I guess the NFS server is inside the Kubernetes cluster.

Can you check how your NFS was deployed?
Was the NFS server accessible by other in-cluster Pods after the upgrade?

PhanLe0110 · October 20, 2021, 9:55pm

The bug [BUG] Longhorn 1.2.0 - wrong volume permissions inside container / broken fsGroup · Issue #2964 · longhorn/longhorn · GitHub is not related to this issue.
That bug is fixed in Longhorn v1.2.2 that is the version you are using.

Marx_N · October 25, 2021, 7:19am

maybe it’s not fixed? Or maybe it is, but I need to do something on my own to make it working (like reset, restart, reconfigure sth.)? How can I check it?
I suppose reinstalling would fix it, but as I have no backup working, I can’t do that

PhanLe0110 · December 14, 2021, 12:23am

Sorry for the late response. The fix only works for the newly provisioned PV.

andreasnielsen87 · December 20, 2021, 10:22am

You will lose any ability to add or remove nodes from the cluster, perform etcd backups or disaster recovery, or edit any of the cluster configuration if you launch a Kubernetes cluster in one Rancher instance and then try to import it into another Rancher instance using the imported cluster feature.

Lorenzo_Gaviglio · August 15, 2024, 1:27pm

Hey Marx, did you get any solution for this problem? I’m fasing a similar situation and I don’t know how to solve it

Topic		Replies	Views
Backup is not working	0	630	July 28, 2021
NFS backup target (localhost) Longhorn	3	4869	January 6, 2021
Systemd complaining (still working fine) Longhorn	4	3114	April 5, 2019
Longhorn Backup UI 504 timeout Longhorn	1	1567	January 8, 2021
Failed Upgrade from v0.8.1 to v1.0.0 caused by pv created before v0.6.2 Longhorn	9	2303	June 8, 2020

Error after migration

Related topics