Seems to be issues recovering NFS with unclean host shutdowns

uberamd · October 2, 2016, 12:49pm

This is the second time in two independent Rancher environments where I’ve had major issues when a host shutdowns uncleanly – preventing me from starting any new instances of the containers. The logs don’t seem to provide much useful info, but it seems I’m not alone in this (shared this way because apparently I can’t put links in posts as a new user):

http://forums.rancher.com/t/containers-stuck-at-scheduling/2441
http://forums.rancher.com/t/convoy-nfs-volumes-stuck-in-deactivating-stage/2547

What happens is that any future containers that Rancher tries to create that share the same NFS volume (setup by the convoy-nfs service) get stuck in a Scheduling state. Finally I went and looked at the storage, as I noticed if I upgrade the stack to use no volumes from convoy-nfs things start fine, and noticed that the volume is stuck in a Deactivating state.

This is the view from the API:

{
    "id": "1v450",
    "type": "volume",
    "links": {
    "self": "…/v1/projects/1a144/volumes/1v450",
    "account": "…/v1/projects/1a144/volumes/1v450/account",
    "backups": "…/v1/projects/1a144/volumes/1v450/backups",
    "mounts": "…/v1/projects/1a144/volumes/1v450/mounts",
    "snapshots": "…/v1/projects/1a144/volumes/1v450/snapshots",
    "storagePools": "…/v1/projects/1a144/volumes/1v450/storagepools",
},
"actions": { },
"name": "consul-server",
"state": "deactivating",
"accessMode": null,
"accountId": "1a144",
"created": "2016-09-21T00:51:18Z",
"createdTS": 1474419078000,
"description": "Consul Server storage",
"driver": "convoy-nfs",
"driverOpts": { },
"externalId": "consul-server",
"imageId": null,
"instanceId": null,
"isHostPath": false,
"kind": "volume",
"removed": null,
"transitioning": "yes",
"transitioningMessage": "In Progress",
"transitioningProgress": null,
"uri": "convoy-nfs:///consul-server",
"uuid": "5669dcbc-bee1-4f79-a2e4-33bce197c894",
}

Sure enough, stuck in a transitioning state.

Only way to resolve this appears to be to create a new volume, go into the NFS server over SSH, copy my data over, then upgrade the container to use the newly created volume. And even still the “transitioning” volume lingers. Blah, that’s not a very desirable state.

Anyone have ideas on how to kick volumes out of a transitioning state and stop them from getting stuck when a host shuts down uncleanly?

Cheers!

Topic		Replies	Views
Broken/frozen volume after nfs-storagepool service crash Convoy	4	2153	January 31, 2017
Convoy NFS - Volumes Stuck in Deactivating Stage RancherOS	0	1058	April 21, 2016
Cloned container retains NFS mounts Rancher 1.x	4	1026	June 6, 2016
Selinux problems with NFS Rancher 1.x	0	799	February 6, 2017
[SOLVED] Convoy-nfs: Failed to allocate instance, bad instance, scheduling failed Convoy	2	4506	November 10, 2016

Seems to be issues recovering NFS with unclean host shutdowns

Related topics