Readiness probe failedsh: /data/longhorn: Permission denied - DaemonSet/engine-image-ei

I have set up a K3s+Longhorn cluster with 4x nodes and 2x 1Tb Samsung SSDs attached to 2 of the nodes. However when I plug an SSD into one of the nodes longhornio/longhorn-engine:v1.4.0 does not have permission to write to this SSD.

On the two nodes with an SSD attached I can see that these pods longhornio/longhorn-engine:v1.4.0 are failing to start and is stuck in CrashLoopBackOff due to a failed Liveness/Readiness probe. But the node’s without SSDs are working correctly.

$> kubectl get pods | grep engine
engine-image-ei-fc06c6fb-q427z                        1/1     Running            3 (45h ago)      20d
engine-image-ei-fc06c6fb-dkkhm                        1/1     Running            6 (38h ago)      20d
engine-image-ei-fc06c6fb-9hpl4                        0/1     Terminating        2229 (17h ago)   20d
engine-image-ei-fc06c6fb-lfv58                        0/1     CrashLoopBackOff   275 (68s ago)    15h
$> kubectl describe pod engine-image-ei-fc06c6fb-lfv58

...
Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Normal   Pulled     46m (x253 over 15h)   kubelet  Container image "longhornio/longhorn-engine:v1.4.0" already present on machine
  Warning  Unhealthy  31m (x1820 over 15h)  kubelet  Readiness probe failed: ls: cannot access '/data/longhorn': No such file or directory
  Warning  Unhealthy  11m (x803 over 15h)   kubelet  Readiness probe failed: /data/longhorn sh: /data/longhorn: Permission denied
  Warning  BackOff  77s (x3163 over 15h)  kubelet  Back-off restarting failed container

Question
What do I need to do to the SSD to allow the pod to write to it?


– MORE INFO –

Cluster Set Up:
I set up the cluster using this repo GitHub - k3s-io/k3s-ansible
I have 4 notes, 1 master 3 nodes 2 of the nodes have 1Tb SSD attached.

Longhorn Set Up:

kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml

All containers are up and running when I run:

$ kubectl -n longhorn-system get pod
NAME                                        READY     STATUS    RESTARTS   AGE
csi-attacher-6fdc77c485-8wlpg               1/1       Running   0          9d
csi-attacher-6fdc77c485-psqlr               1/1       Running   0          9d
engine-image-ei-6e2b0e32-wgkj5              1/1       Running   0          9d
longhorn-csi-plugin-g8r4b                   1/1       Running   0          9d
....

When I set-up the ingress controller I can access the Longhorn Web UI and see all of my nodes.

SSD Set-Up
I have set-up the SSDs to auto-mount to /var/lib/longhorn using fstab

$> sudo nano /etc/fstab

UUID=4eab5a89-40f4-40c4-9bd4-2324c257ba6e /var/lib/longhorn/ ext4 defaults,auto,users,rw,nofail,noatime 0 0

This seems to be working, since when I restart the nodes the storage space jumps from 30GB (SD cards) to ~1.8 Ti. So I know that the drives have mounted correctly and showing the right amount of storage space. But I spotted that longhornio/longhorn-engine:v1.4.0 will not deploy when the SSD is attached.

$> kubectl get pods | grep engine
engine-image-ei-fc06c6fb-q427z                        1/1     Running            3 (45h ago)      20d
engine-image-ei-fc06c6fb-dkkhm                        1/1     Running            6 (38h ago)      20d
engine-image-ei-fc06c6fb-9hpl4                        0/1     Terminating        2229 (17h ago)   20d
engine-image-ei-fc06c6fb-lfv58                        0/1     CrashLoopBackOff   275 (68s ago)    15h
$> kubectl describe pod engine-image-ei-fc06c6fb-lfv58

Name:         engine-image-ei-fc06c6fb-lfv58
Namespace:    longhorn-system
...
Containers:
  engine-image-ei-fc06c6fb:
    Container ID:  containerd://2a246b55250f91fea65cdd7ae3904258dcb32fe2d6cec2bde87b62e5e1e4f326
    Image:         longhornio/longhorn-engine:v1.4.0
    Image ID:      docker.io/longhornio/longhorn-engine@sha256:8356a9f5900f31d0f771ca10a479dfa10c8a88dd9a1760bbb137c7279db9815a
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
    Args:
      -c
      diff /usr/local/bin/longhorn /data/longhorn > /dev/null 2>&1; if [ $? -ne 0 ]; then cp -p /usr/local/bin/longhorn /data/ && echo installed; fi && trap 'rm /data/longhorn* && echo cleaned up' EXIT && sleep infinity
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Mon, 30 Jan 2023 08:17:13 -0500
      Finished:     Mon, 30 Jan 2023 08:17:58 -0500
    Ready:          False
    Restart Count:  275
    Liveness:       exec [sh -c /data/longhorn version --client-only] delay=3s timeout=4s period=5s #success=1 #failure=3
    Readiness:      exec [sh -c ls /data/longhorn && /data/longhorn version --client-only] delay=3s timeout=4s period=5s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /data/ from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lg68l (ro)
Volumes:
  data:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.4.0
    HostPathType:  
  kube-api-access-lg68l:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true


Events:
  Type     Reason     Age                  From     Message
  ----     ------     ----                 ----     -------
  Warning  Unhealthy  43m (x803 over 15h)  kubelet  Readiness probe failed: /data/longhorn
sh: /data/longhorn: Permission denied
  Warning  Unhealthy  23m (x1898 over 15h)    kubelet  Readiness probe failed: ls: cannot access '/data/longhorn': No such file or directory
  Warning  BackOff    8m31s (x3253 over 15h)  kubelet  Back-off restarting failed container
  Warning  Unhealthy  3m41s (x824 over 15h)   kubelet  Liveness probe failed: sh: /data/longhorn: Permission denied

Solution 1: vfat/ext4
I have tried formating the drive as both vfat or ext4 and neither have worked

Solution 2: dmask/fmask

I have tried messing with the fatab by adding uid=1000,gid=100,dmask=000,fmask=111 but I’m not 100% sure what I’m doing with these config settings so was a bit of a longshout but didn’t resolve mis issue anyway.

https://help.ubuntu.com/community/Fstab

UUID=4eab5a89-40f4-40c4-9bd4-2324c257ba6e /var/lib/longhorn/ ext4 defaults,auto,users,rw,nofail,noatime,dmask=000,fmask=111 0 0

DEBUG INFO: ls -ls HOSTPATH

ls -ls  /var /var/lib /var/lib/longhorn /var/lib/longhorn/engine-binaries  /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.4.0
/var:
total 102436
     4 drwxr-xr-x  2 root root       4096 Jan 30 06:25 backups
     4 drwxr-xr-x 12 root root       4096 Sep 21 22:17 cache
     4 drwxr-xr-x 50 root root       4096 Jan 11 11:56 lib
     4 drwxrwsr-x  2 root staff      4096 Sep  3 07:10 local
     0 lrwxrwxrwx  1 root root          9 Sep 21 21:49 lock -> /run/lock
     4 drwxr-xr-x 12 root root       4096 Jan 30 00:00 log
     4 drwxrwsr-x  2 root mail       4096 Sep 21 21:49 mail
     4 drwxr-xr-x  2 root root       4096 Sep 21 21:49 opt
     0 lrwxrwxrwx  1 root root          4 Sep 21 21:49 run -> /run
     4 drwxr-xr-x  5 root root       4096 Sep 21 21:58 spool
102400 -rw-------  1 root root  104857600 Sep 21 22:17 swap
     4 drwxrwxrwt  5 root root       4096 Jan 30 00:00 tmp

/var/lib:
total 192
.... 
4 drwxr-x---  7 root    root    4096 Jan  9 16:36 kubelet
4 drwxr-x---  5 lightdm lightdm 4096 Jan  9 16:03 lightdm
4 drwxr-xr-x  2 root    root    4096 Jan 30 00:00 logrotate
4 drwxr-xr-x  4 root    root    4096 Jan 30 07:23 longhorn
4 drwxr-xr-x  2 root    root    4096 Sep 21 21:53 man-db
4 drwxr-xr-x  2 root    root    4096 Sep  3 07:10 misc
4 drwx------  2 root    root    4096 Mar 21  2022 NetworkManager
4 drwxr-xr-x  4 root    root    4096 Sep 21 21:52 nfs
....

/var/lib/longhorn:
total 8
4 drwxr-xr-x 3 root root 4096 Jan 29 15:26 engine-binaries
4 drwxr-xr-x 2 root root 4096 Jan 29 15:31 unix-domain-socket

/var/lib/longhorn/engine-binaries:
total 4
4 drwxr-xr-x 2 root root 4096 Jan 30 07:23 longhornio-longhorn-engine-v1.4.0

/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.4.0:
total 24864
24864 -rwxr-xr-x 1 root root 25457888 Dec 29 16:56 longhorn

It’s stale, but just for further reference if somebody struggles: Check your mount options in /etc/fstab. My longhorn failed because the security folks insisted on the noexec mount flag.