Broken/frozen volume after nfs-storagepool service crash


#1

We’re still using Rancher 1.1.4, and are working on upgrading to Rancher 1.3.x (with Rancher-NFS) so this may not be a huge issue if it cannot be solved, or if it’s due to a known (and fixed) bug in 1.1.4.

We have a Convoy-NFS service and at some point, the convoy-nfs-storagepool service crashed (showed as 0 containers when looking at system stacks). We noticed this problem because creating a new volume got hung in the “activating” state. After restarting the convoy-nfs-storagepool service, the volume went from the “activating” state to “active”, but it does not have any API actions associated with it – the volume cannot be deleted or purged (which is what I need to do now).

How do I go about forcibly removing the volume? I presume that if the API won’t let me deactivate or delete the volume, that I’ll probably have to remove it directly from the MySQL database – that’s OK, I just need to know what row(s) need to be removed to make it go away.

The volume itself never actually initialized, so there’s no data at risk here. No containers ever actually used the volume. It’s just lingering in the volume list and I can’t get rid of the entry.


#2

So a docker volume rm volume_name did not work and still hung?

I also found a couple other volume cleanup commands:

List all “dangling” volumes:
docker volume ls -f dangling=true

Remove all “dangling” volumes:
docker volume rm $(docker volume ls -f dangling=true -q)


#3

The volume in this case is not a docker volume, it’s a Convoy-NFS volume. So docker is unaware of its existence. I can only see the volume in the Rancher API and UI.


#4

You did say that, my mistake.

If this service is running as a container, aren’t there container logs? Perhaps more info is there to see why this is happening.


#5

I don’t believe the problem is with the Convoy-NFS service directly – I think it’s just that the volume got marked as “active” in the database and the only way for it to get marked as “inactive” is for the scheduler to stop the last container that uses the volume. But since it doesn’t have any containers using the volume, it’s in an undefined area of the state diagram. It’s not supposed to be possible for a volume to be both “active” and have no containers using it. But since this volume happens to be in that state, I can’t delete it.