User-docker hangs indefinitely

What I had

I am running RancherOS 1.1.0 on AWS EC2 with nginx reverse proxy running on it and Rancher 1.6.x on port 8080 and a static website on port 80…
Everything working fine

What I did

From the Rancher UI I did the installation of Gitlab-CE (following this Rancher article), but (stupid, stupid) I forgot to modify the install to run behind the nginx proxy, and instead used port 80. So predictably the Gitlab + Postman containers were Unhealthy…

Then Rancher UI became unresponsive (probably nginx no longer working okay, but I could still reach the website at port 80). So I SSH’ed into the AWS instance and did a docker ps which - while slow - worked and showed the unhealthy gitlab + postman containers. I tried to stop them, but this yielded:

Error response from daemon: Cannot stop container r-gitlab-gitlab-1-3e6e511a: Cannot kill container <container-id>: rpc error: code = 14 desc = grpc: the connection is unavailable

So I tried sudo system-docker restart docker, which seemed to work, but docker didn’t come up properly:

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Finally I thought to restart the EC2 instance entirely, but since then docker (version docker-17.03.2-ce) is completely unresponsive and no container is reachable (website off-the-air, and Rancher UI too).
sudo system-docker logs docker says:

time="2017-12-08T07:44:20Z" level=info msg="Starting Docker in context: console" 
time="2017-12-08T07:44:20Z" level=error msg="non-200 http response: 404" 
time="2017-12-08T07:44:20Z" level=error msg="Failed to load service(xenhvm-vm-tools): non-200 http response: 404" 
time="2017-12-08T07:44:20Z" level=info msg="Getting PID for service: console" 
time="2017-12-08T07:44:21Z" level=info msg="console PID 1037" 
time="2017-12-08T07:44:21Z" level=info msg="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=<internal-ec2-hostname> HOME=/]" 
time="2017-12-08T07:44:21Z" level=info msg="Running [docker-runc exec -- <system-docker-docker-container-id> env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=<internal-ec2-hostname> HOME=/ ros docker-init daemon --storage-driver overlay --host unix:///var/run/docker.sock --log-opt max-file=2 --log-opt max-size=25m --group docker]" 
time="2017-12-08T07:44:21Z" level=info msg="Found /usr/bin/dockerd" 
time="2018-02-20T10:36:24Z" level=info msg="Starting Docker in context: console" 
time="2018-02-20T10:36:24Z" level=error msg="non-200 http response: 404" 
time="2018-02-20T10:36:24Z" level=error msg="Failed to load service(xenhvm-vm-tools): non-200 http response: 404" 
time="2018-02-20T10:36:24Z" level=info msg="Getting PID for service: console" 
time="2018-02-20T10:36:24Z" level=info msg="console PID 900" 
time="2018-02-20T10:36:24Z" level=info msg="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=<internal-ec2-hostname> HOME=/]" 
time="2018-02-20T10:36:24Z" level=info msg="Running [docker-runc exec -- <system-docker-docker-container-id> env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=ip-10-0-0-24.eu-central-1.compute.internal HOME=/ ros docker-init daemon --host unix:///var/run/docker.sock --log-opt max-file=2 --log-opt max-size=25m --storage-driver overlay --group docker]" 
time="2018-02-20T10:36:24Z" level=info msg="Found /usr/bin/dockerd" 

This error looks like: https://github.com/rancher/os/issues/2244
But there the solution was to clear the cache at /var/lib/rancher/cache, but my cache directory is already empty…

Where I turned up

Any docker command I give through SSH hangs indefinitely. system-docker is working fine.

Desired solution

My site is now down and Rancher unreachable. How can I return to a working state, preferably without the gitlab and postman containers stopped or removed, but not losing the other containers?

I ended up recreating and reinstalling the AWS instance :frowning:

well… this doesn’t help me