I have set up a K3s+Longhorn cluster with 4x nodes and 2x 1Tb Samsung SSDs attached to 2 of the nodes. However when I plug an SSD into one of the nodes longhornio/longhorn-engine:v1.4.0 does not have permission to write to this SSD.
On the two nodes with an SSD attached I can see that these pods longhornio/longhorn-engine:v1.4.0 are failing to start and is stuck in CrashLoopBackOff due to a failed Liveness/Readiness probe. But the node’s without SSDs are working correctly.
$> kubectl get pods | grep engine
engine-image-ei-fc06c6fb-q427z 1/1 Running 3 (45h ago) 20d
engine-image-ei-fc06c6fb-dkkhm 1/1 Running 6 (38h ago) 20d
engine-image-ei-fc06c6fb-9hpl4 0/1 Terminating 2229 (17h ago) 20d
engine-image-ei-fc06c6fb-lfv58 0/1 CrashLoopBackOff 275 (68s ago) 15h
$> kubectl describe pod engine-image-ei-fc06c6fb-lfv58
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 46m (x253 over 15h) kubelet Container image "longhornio/longhorn-engine:v1.4.0" already present on machine
Warning Unhealthy 31m (x1820 over 15h) kubelet Readiness probe failed: ls: cannot access '/data/longhorn': No such file or directory
Warning Unhealthy 11m (x803 over 15h) kubelet Readiness probe failed: /data/longhorn sh: /data/longhorn: Permission denied
Warning BackOff 77s (x3163 over 15h) kubelet Back-off restarting failed container
Question
What do I need to do to the SSD to allow the pod to write to it?
– MORE INFO –
Cluster Set Up:
I set up the cluster using this repo GitHub - k3s-io/k3s-ansible
I have 4 notes, 1 master 3 nodes 2 of the nodes have 1Tb SSD attached.
Longhorn Set Up:
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml
All containers are up and running when I run:
$ kubectl -n longhorn-system get pod
NAME READY STATUS RESTARTS AGE
csi-attacher-6fdc77c485-8wlpg 1/1 Running 0 9d
csi-attacher-6fdc77c485-psqlr 1/1 Running 0 9d
engine-image-ei-6e2b0e32-wgkj5 1/1 Running 0 9d
longhorn-csi-plugin-g8r4b 1/1 Running 0 9d
....
When I set-up the ingress controller I can access the Longhorn Web UI and see all of my nodes.
SSD Set-Up
I have set-up the SSDs to auto-mount to /var/lib/longhorn using fstab
$> sudo nano /etc/fstab
UUID=4eab5a89-40f4-40c4-9bd4-2324c257ba6e /var/lib/longhorn/ ext4 defaults,auto,users,rw,nofail,noatime 0 0
This seems to be working, since when I restart the nodes the storage space jumps from 30GB (SD cards) to ~1.8 Ti. So I know that the drives have mounted correctly and showing the right amount of storage space. But I spotted that longhornio/longhorn-engine:v1.4.0 will not deploy when the SSD is attached.
$> kubectl get pods | grep engine
engine-image-ei-fc06c6fb-q427z 1/1 Running 3 (45h ago) 20d
engine-image-ei-fc06c6fb-dkkhm 1/1 Running 6 (38h ago) 20d
engine-image-ei-fc06c6fb-9hpl4 0/1 Terminating 2229 (17h ago) 20d
engine-image-ei-fc06c6fb-lfv58 0/1 CrashLoopBackOff 275 (68s ago) 15h
$> kubectl describe pod engine-image-ei-fc06c6fb-lfv58
Name: engine-image-ei-fc06c6fb-lfv58
Namespace: longhorn-system
...
Containers:
engine-image-ei-fc06c6fb:
Container ID: containerd://2a246b55250f91fea65cdd7ae3904258dcb32fe2d6cec2bde87b62e5e1e4f326
Image: longhornio/longhorn-engine:v1.4.0
Image ID: docker.io/longhornio/longhorn-engine@sha256:8356a9f5900f31d0f771ca10a479dfa10c8a88dd9a1760bbb137c7279db9815a
Port: <none>
Host Port: <none>
Command:
/bin/bash
Args:
-c
diff /usr/local/bin/longhorn /data/longhorn > /dev/null 2>&1; if [ $? -ne 0 ]; then cp -p /usr/local/bin/longhorn /data/ && echo installed; fi && trap 'rm /data/longhorn* && echo cleaned up' EXIT && sleep infinity
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Mon, 30 Jan 2023 08:17:13 -0500
Finished: Mon, 30 Jan 2023 08:17:58 -0500
Ready: False
Restart Count: 275
Liveness: exec [sh -c /data/longhorn version --client-only] delay=3s timeout=4s period=5s #success=1 #failure=3
Readiness: exec [sh -c ls /data/longhorn && /data/longhorn version --client-only] delay=3s timeout=4s period=5s #success=1 #failure=3
Environment: <none>
Mounts:
/data/ from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lg68l (ro)
Volumes:
data:
Type: HostPath (bare host directory volume)
Path: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.4.0
HostPathType:
kube-api-access-lg68l:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 43m (x803 over 15h) kubelet Readiness probe failed: /data/longhorn
sh: /data/longhorn: Permission denied
Warning Unhealthy 23m (x1898 over 15h) kubelet Readiness probe failed: ls: cannot access '/data/longhorn': No such file or directory
Warning BackOff 8m31s (x3253 over 15h) kubelet Back-off restarting failed container
Warning Unhealthy 3m41s (x824 over 15h) kubelet Liveness probe failed: sh: /data/longhorn: Permission denied
Solution 1: vfat/ext4
I have tried formating the drive as both vfat or ext4 and neither have worked
Solution 2: dmask/fmask
I have tried messing with the fatab by adding uid=1000,gid=100,dmask=000,fmask=111 but I’m not 100% sure what I’m doing with these config settings so was a bit of a longshout but didn’t resolve mis issue anyway.
https://help.ubuntu.com/community/Fstab
UUID=4eab5a89-40f4-40c4-9bd4-2324c257ba6e /var/lib/longhorn/ ext4 defaults,auto,users,rw,nofail,noatime,dmask=000,fmask=111 0 0
DEBUG INFO: ls -ls HOSTPATH
ls -ls /var /var/lib /var/lib/longhorn /var/lib/longhorn/engine-binaries /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.4.0
/var:
total 102436
4 drwxr-xr-x 2 root root 4096 Jan 30 06:25 backups
4 drwxr-xr-x 12 root root 4096 Sep 21 22:17 cache
4 drwxr-xr-x 50 root root 4096 Jan 11 11:56 lib
4 drwxrwsr-x 2 root staff 4096 Sep 3 07:10 local
0 lrwxrwxrwx 1 root root 9 Sep 21 21:49 lock -> /run/lock
4 drwxr-xr-x 12 root root 4096 Jan 30 00:00 log
4 drwxrwsr-x 2 root mail 4096 Sep 21 21:49 mail
4 drwxr-xr-x 2 root root 4096 Sep 21 21:49 opt
0 lrwxrwxrwx 1 root root 4 Sep 21 21:49 run -> /run
4 drwxr-xr-x 5 root root 4096 Sep 21 21:58 spool
102400 -rw------- 1 root root 104857600 Sep 21 22:17 swap
4 drwxrwxrwt 5 root root 4096 Jan 30 00:00 tmp
/var/lib:
total 192
....
4 drwxr-x--- 7 root root 4096 Jan 9 16:36 kubelet
4 drwxr-x--- 5 lightdm lightdm 4096 Jan 9 16:03 lightdm
4 drwxr-xr-x 2 root root 4096 Jan 30 00:00 logrotate
4 drwxr-xr-x 4 root root 4096 Jan 30 07:23 longhorn
4 drwxr-xr-x 2 root root 4096 Sep 21 21:53 man-db
4 drwxr-xr-x 2 root root 4096 Sep 3 07:10 misc
4 drwx------ 2 root root 4096 Mar 21 2022 NetworkManager
4 drwxr-xr-x 4 root root 4096 Sep 21 21:52 nfs
....
/var/lib/longhorn:
total 8
4 drwxr-xr-x 3 root root 4096 Jan 29 15:26 engine-binaries
4 drwxr-xr-x 2 root root 4096 Jan 29 15:31 unix-domain-socket
/var/lib/longhorn/engine-binaries:
total 4
4 drwxr-xr-x 2 root root 4096 Jan 30 07:23 longhornio-longhorn-engine-v1.4.0
/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.4.0:
total 24864
24864 -rwxr-xr-x 1 root root 25457888 Dec 29 16:56 longhorn