Hi folks,
Any help would be really appreciated
I have the following environment:
kubectl version
Client Version: v1.28.6
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.5+k3s1
4 nodes, 2 dedicated for generic pods, 2 dedicated for Longhorn volumes.
I have a volume created as ReadWriteMany
I have 1 deployment with pods running across the 2 nodes accessing the ReadWriteMany. All working as intended.
The installation was based on default configurations
Now the tricky part that makes everything so confuse. As soon as I add a new node, no pod running in this node is able to mount the volume (see error message below). All others pods running in different nodes still works fine.
Mounting arguments: mount -t nfs -o vers=4.1,noresvport,intr,hard 10.43.40.29:/longhorn-data /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/edf8a47b178e53d854331a01716c090cf6bf6f2cc5ce2fc0e99d057589d1b542/globalmount
Output: mount.nfs: Protocol not supported
Each node has the following information
node01 - nfsstat –s
Server rpc stats:
calls badcalls badfmt badauth badclnt
0 0 0 0 0
Client rpc stats:
calls retrans authrefrsh
131095257 2 131113216
Client nfs v4:
null read write commit open
10 0% 3549325 2% 17609021 13% 9141 0% 3933223 3%
node02 - nfsstat –s
Server rpc stats:
calls badcalls badfmt badauth badclnt
0 0 0 0 0
Client rpc stats:
calls retrans authrefrsh
18727523 0 18729915
Client nfs v4:
null read write commit open
2 0% 11 0% 3244833 17% 240 0% 954546 5%
node03 - nfsstat –s (with the problem)
Client rpc stats:
calls retrans authrefrsh
49 0 49
Client nfs v4:
null read write commit open
7 14% 0 0% 0 0% 0 0% 0 0%
Node00 - systemctl status iscsid.service
● iscsid.service - iSCSI initiator daemon (iscsid)
Loaded: loaded (/lib/systemd/system/iscsid.service; disabled; vendor preset: enabled)
Active: active (running) since Thu 2024-02-15 18:42:58 PST; 11s ago
TriggeredBy: ● iscsid.socket
Docs: man:iscsid(8)
Process: 3302780 ExecStartPre=/lib/open-iscsi/startup-checks.sh (code=exited, status=0/SUCCESS)
Process: 3302793 ExecStart=/sbin/iscsid (code=exited, status=0/SUCCESS)
Main PID: 3302795 (iscsid)
Tasks: 2 (limit: 9269)
Memory: 3.2M
CGroup: /system.slice/iscsid.service
├─3302794 /sbin/iscsid
└─3302795 /sbin/iscsid
Feb 15 18:42:58 chromebox00 systemd[1]: Starting iSCSI initiator daemon (iscsid)...
Feb 15 18:42:58 chromebox00 iscsid[3302793]: iSCSI logger with pid=3302794 started!
Feb 15 18:42:58 chromebox00 systemd[1]: iscsid.service: Failed to parse PID from file /run/iscsid.pid: Invalid argument
Feb 15 18:42:58 chromebox00 systemd[1]: Started iSCSI initiator daemon (iscsid).
Feb 15 18:42:59 chromebox00 iscsid[3302794]: iSCSI daemon with pid=3302795 started!
Node01 - systemctl status iscsid.service
● iscsid.service - iSCSI initiator daemon (iscsid)
Loaded: loaded (/lib/systemd/system/iscsid.service; disabled; vendor preset: enabled)
Active: active (running) since Thu 2024-02-08 10:13:38 PST; 1 weeks 0 days ago
TriggeredBy: ● iscsid.socket
Docs: man:iscsid(8)
Main PID: 4161 (iscsid)
Tasks: 2 (limit: 9269)
Memory: 3.6M
CGroup: /system.slice/iscsid.service
├─4160 /sbin/iscsid
└─4161 /sbin/iscsid
Node02 - systemctl status iscsid.service
● iscsid.service - iSCSI initiator daemon (iscsid)
Loaded: loaded (/lib/systemd/system/iscsid.service; disabled; vendor preset: enabled)
Active: active (running) since Fri 2024-02-16 02:44:06 UTC; 2s ago
TriggeredBy: ● iscsid.socket
Docs: man:iscsid(8)
Process: 23691 ExecStartPre=/lib/open-iscsi/startup-checks.sh (code=exited, status=0/SUCCESS)
Process: 23694 ExecStart=/sbin/iscsid (code=exited, status=0/SUCCESS)
Main PID: 23696 (iscsid)
Tasks: 2 (limit: 37645)
Memory: 2.8M
CPU: 8ms
CGroup: /system.slice/iscsid.service
├─23695 /sbin/iscsid
└─23696 /sbin/iscsid
Feb 16 02:44:06 chromebox02 systemd[1]: Starting iSCSI initiator daemon (iscsid)...
Feb 16 02:44:06 chromebox02 iscsid[23694]: iSCSI logger with pid=23695 started!
Feb 16 02:44:06 chromebox02 systemd[1]: Started iSCSI initiator daemon (iscsid).
Feb 16 02:44:07 chromebox02 iscsid[23695]: iSCSI daemon with pid=23696 started!
My PersitentVolume
kind: PersistentVolume
apiVersion: v1
metadata:
name: longhorn-data
uid: 21be760f-f2f3-4dde-ae09-a179568c7bd4
resourceVersion: '246513'
creationTimestamp: '2024-02-04T20:12:19Z'
annotations:
longhorn.io/volume-scheduling-error: ''
pv.kubernetes.io/bound-by-controller: 'yes'
finalizers:
- kubernetes.io/pv-protection
- external-attacher/driver-longhorn-io
managedFields:
- manager: k3s
operation: Update
apiVersion: v1
time: '2024-02-04T20:12:20Z'
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:pv.kubernetes.io/bound-by-controller: {}
f:spec:
f:claimRef:
.: {}
f:apiVersion: {}
f:kind: {}
f:name: {}
f:namespace: {}
f:resourceVersion: {}
f:uid: {}
- manager: k3s
operation: Update
apiVersion: v1
time: '2024-02-04T20:12:20Z'
fieldsType: FieldsV1
fieldsV1:
f:status:
f:phase: {}
subresource: status
- manager: csi-attacher
operation: Update
apiVersion: v1
time: '2024-02-04T20:23:40Z'
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
v:"external-attacher/driver-longhorn-io": {}
- manager: longhorn-manager
operation: Update
apiVersion: v1
time: '2024-02-05T00:13:49Z'
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:longhorn.io/volume-scheduling-error: {}
f:spec:
f:accessModes: {}
f:capacity:
.: {}
f:storage: {}
f:csi:
.: {}
f:driver: {}
f:fsType: {}
f:volumeAttributes:
.: {}
f:diskSelector: {}
f:migratable: {}
f:nodeSelector: {}
f:numberOfReplicas: {}
f:staleReplicaTimeout: {}
f:volumeHandle: {}
f:persistentVolumeReclaimPolicy: {}
f:storageClassName: {}
f:volumeMode: {}
spec:
capacity:
storage: 400Gi
csi:
driver: driver.longhorn.io
volumeHandle: longhorn-data
fsType: ext4
volumeAttributes:
diskSelector: ''
migratable: 'false'
nodeSelector: ''
numberOfReplicas: '3'
staleReplicaTimeout: '20'
accessModes:
- ReadWriteMany
claimRef:
kind: PersistentVolumeClaim
namespace: default
name: longhorn-data
uid: af5bfd17-b1b1-437a-a3b1-a9c90ba724f7
apiVersion: v1
resourceVersion: '221539'
persistentVolumeReclaimPolicy: Retain
storageClassName: longhorn-static
volumeMode: Filesystem
status:
phase: Bound
My Volume specs
State: Attached
Health: Healthy
Ready for workload:Ready
Conditions: scheduled
Frontend:Block Device
Backend Data Engine:v1
Offline Replica Rebuilding:disabled
Attached Node & Endpoint:
diskserver00
/dev/longhorn/longhorn-data
Size: 400 Gi
Actual Size:262 Gi
Data Locality:disabled
Access Mode:ReadWriteMany
Engine Image: longhornio/longhorn-engine:v1.5.1
Created: 11 days ago
Encrypted: False
Node Tags:
Disk Tags:
Last Backup:
Last Backup At:
Replicas Auto Balance:ignored
Snapshot Data Integrity:ignored
Instance Manager:
instance-manager-6fcf05c270e6a8475ad7300181f1b8cb
Allow snapshots removal during trim:ignored
Replica Soft Anti Affinity:ignored
Replica Zone Soft Anti Affinity:ignored
Namespace:default
PVC Name:longhorn-data
PV Name:longhorn-data
PV Status:Bound
Revision Counter Disabled:False
Pod Name:crawler-master-6bbd6776f5-bdqwz
Pod Status:Running
Workload Name:crawler-master-6bbd6776f5
Workload Type:ReplicaSet
Pod Name:crawler-worker-7b9f6b4785-2g697
Pod Status:Pending
Workload Name:crawler-worker-7b9f6b4785
Workload Type:ReplicaSet
Pod Name:crawler-worker-7b9f6b4785-7xgjz
Pod Status:Running
Workload Name:crawler-worker-7b9f6b4785
Workload Type:ReplicaSet
Pod Name:crawler-worker-7b9f6b4785-bs26x
Pod Status:Running
Workload Name:crawler-worker-7b9f6b4785
Workload Type:ReplicaSet
Pod Name:crawler-worker-7b9f6b4785-vd6tc
Pod Status:Running
Workload Name:crawler-worker-7b9f6b4785
Workload Type:ReplicaSet