Presentation:
Container(s) won’t start and give off many/various errors pertaining to ‘stale’ NFS mounts. Executing df -h</code/> within the rancher-nfs container exec shell shows mount points having ‘stale NFS file handle.’
Cause:
In our case, we did a live migration of NFS mounts on the NFS server that resulted in a 1-3 second export interruption. Many circumstances causing communication interruption between rancher-nfs and your network storage could have the same effect. No other services or applications within our platform (outside of rancher/docker) were bothered by this specific transition.
Explanation Definitions:
Volume - This is the name of the volume that represents the NFS mount point in Rancher. For the example below, we’ll call this 'DemoVolume’
NFS Target - This is the remote NFS server export that the driver and volume combination points to. For the example below, we’ll call this 'nfsServer01.localdomain.com:/RaidVol1/Docker/Rancher/DemoConfig’
Local Mount Path - This is the local filesystem path inside the container where the remote NFS Target is mounted. For the example below, we’ll call this '/home/user3/config’
Rancher-NFS Mount Path - This is the path on the rancher-nfs container where the NFS Target is mounted and the Rancher Volume is created. For the example below, we’ll call this '/var/lib/rancher/volumes/rancher-nfs/DemoVolume’
Volume Mount - This is the combination of Volume and Local Mount Path that you provide to Rancher during container creation.
Resolution:
- From the rancher-nfs container exec shell, run
umount -l /var/lib/rancher/volumes/rancher-nfs/DemoVolume
where the path is the container-local mount path. - From the Rancher management GUI (or CLI, I guess) create a new container with the affected volume using any Local Mount Path within your dummy container: ‘DemoVolume:/home/user3/config’
- Start the container.
- From the rancher-nfs container exec shell, run
df -h
and note that the NFS mount appears again and is healthy. Restart of real/production containers leveraging the affected volume(s) may be necessary. We saw some images pickup the restored volume(s) and some not.
Any questions, leave them here. I couldn’t find any indication on any forums on how to fix this issue - so this hack is better than nothing for the time being. Hopefully, as it did in our case, it will save you a reboot of all the hosts in your environment.