Broken/frozen volume after nfs-storagepool service crash

pkrizak · January 31, 2017, 9:17pm

We’re still using Rancher 1.1.4, and are working on upgrading to Rancher 1.3.x (with Rancher-NFS) so this may not be a huge issue if it cannot be solved, or if it’s due to a known (and fixed) bug in 1.1.4.

We have a Convoy-NFS service and at some point, the convoy-nfs-storagepool service crashed (showed as 0 containers when looking at system stacks). We noticed this problem because creating a new volume got hung in the “activating” state. After restarting the convoy-nfs-storagepool service, the volume went from the “activating” state to “active”, but it does not have any API actions associated with it – the volume cannot be deleted or purged (which is what I need to do now).

How do I go about forcibly removing the volume? I presume that if the API won’t let me deactivate or delete the volume, that I’ll probably have to remove it directly from the MySQL database – that’s OK, I just need to know what row(s) need to be removed to make it go away.

The volume itself never actually initialized, so there’s no data at risk here. No containers ever actually used the volume. It’s just lingering in the volume list and I can’t get rid of the entry.

mister2d · January 31, 2017, 10:27pm

So a docker volume rm volume_name did not work and still hung?

I also found a couple other volume cleanup commands:

List all “dangling” volumes:
docker volume ls -f dangling=true

Remove all “dangling” volumes:
docker volume rm $(docker volume ls -f dangling=true -q)

pkrizak · January 31, 2017, 10:32pm

The volume in this case is not a docker volume, it’s a Convoy-NFS volume. So docker is unaware of its existence. I can only see the volume in the Rancher API and UI.

mister2d · January 31, 2017, 10:34pm

You did say that, my mistake.

If this service is running as a container, aren’t there container logs? Perhaps more info is there to see why this is happening.

pkrizak · January 31, 2017, 10:42pm

I don’t believe the problem is with the Convoy-NFS service directly – I think it’s just that the volume got marked as “active” in the database and the only way for it to get marked as “inactive” is for the scheduler to stop the last container that uses the volume. But since it doesn’t have any containers using the volume, it’s in an undefined area of the state diagram. It’s not supposed to be possible for a volume to be both “active” and have no containers using it. But since this volume happens to be in that state, I can’t delete it.

Topic		Replies	Views
Seems to be issues recovering NFS with unclean host shutdowns Rancher 1.x	0	790	October 2, 2016
Cloned container retains NFS mounts Rancher 1.x	4	1026	June 6, 2016
Convoy NFS - Volumes Stuck in Deactivating Stage RancherOS	0	1058	April 21, 2016
How do I remove a storage pool? Rancher 1.x	6	3787	May 2, 2016
Can't get rid of some volumes Convoy	1	1928	June 22, 2016

Broken/frozen volume after nfs-storagepool service crash

Related topics