No way to recover from EBS-related errors

We have had some problems with containers using EBS-backed volumes. It seems to me that there’s no way to force rancher to “stop trying” in case of volume mount errors or similar - scaling down the service to 0 has no effect, everything is still stuck in “starting” state - even if I go to the aws console and force detach the volume.

For EBS-backed volumes to be production ready, we need more options in terms of breakglass functionality in rancher, such as

  • Force stop container
  • Force check if the volume is actually mounted
  • Force detach volume

We’ve had multiple situations where the only thing we could do was to bring down the entire rancher environment, detach all volumes in aws and then start it up again. “not fun” in dev/test, and completely unacceptable in prod.

Just to add on this, it also may happen that there’s a mismatch between the volumes Rancher thinks are mapped, and the volumes that actually are mapped. In these cases, there’s no way to “force disconnect” or “force check” the mapping status of a volume. It all seems extremely brittle to me.

I’m also seeing situations where Rancher “knows” the volume is detached, but it still doesnt seem to make any attempts to mount it before starting the container that depends on it.