Seems to be issues recovering NFS with unclean host shutdowns

This is the second time in two independent Rancher environments where I’ve had major issues when a host shutdowns uncleanly – preventing me from starting any new instances of the containers. The logs don’t seem to provide much useful info, but it seems I’m not alone in this (shared this way because apparently I can’t put links in posts as a new user):

http://forums.rancher.com/t/containers-stuck-at-scheduling/2441
http://forums.rancher.com/t/convoy-nfs-volumes-stuck-in-deactivating-stage/2547

What happens is that any future containers that Rancher tries to create that share the same NFS volume (setup by the convoy-nfs service) get stuck in a Scheduling state. Finally I went and looked at the storage, as I noticed if I upgrade the stack to use no volumes from convoy-nfs things start fine, and noticed that the volume is stuck in a Deactivating state.

This is the view from the API:

{
    "id": "1v450",
    "type": "volume",
    "links": {
    "self": "…/v1/projects/1a144/volumes/1v450",
    "account": "…/v1/projects/1a144/volumes/1v450/account",
    "backups": "…/v1/projects/1a144/volumes/1v450/backups",
    "mounts": "…/v1/projects/1a144/volumes/1v450/mounts",
    "snapshots": "…/v1/projects/1a144/volumes/1v450/snapshots",
    "storagePools": "…/v1/projects/1a144/volumes/1v450/storagepools",
},
"actions": { },
"name": "consul-server",
"state": "deactivating",
"accessMode": null,
"accountId": "1a144",
"created": "2016-09-21T00:51:18Z",
"createdTS": 1474419078000,
"description": "Consul Server storage",
"driver": "convoy-nfs",
"driverOpts": { },
"externalId": "consul-server",
"imageId": null,
"instanceId": null,
"isHostPath": false,
"kind": "volume",
"removed": null,
"transitioning": "yes",
"transitioningMessage": "In Progress",
"transitioningProgress": null,
"uri": "convoy-nfs:///consul-server",
"uuid": "5669dcbc-bee1-4f79-a2e4-33bce197c894",
}

Sure enough, stuck in a transitioning state.

Only way to resolve this appears to be to create a new volume, go into the NFS server over SSH, copy my data over, then upgrade the container to use the newly created volume. And even still the “transitioning” volume lingers. Blah, that’s not a very desirable state.

Anyone have ideas on how to kick volumes out of a transitioning state and stop them from getting stuck when a host shuts down uncleanly?

Cheers!