What I had
I am running RancherOS 1.1.0
on AWS EC2 with nginx
reverse proxy running on it and Rancher 1.6.x
on port 8080 and a static website on port 80…
Everything working fine
What I did
From the Rancher UI I did the installation of Gitlab-CE (following this Rancher article), but (stupid, stupid) I forgot to modify the install to run behind the nginx proxy, and instead used port 80. So predictably the Gitlab + Postman containers were Unhealthy…
Then Rancher UI became unresponsive (probably nginx no longer working okay, but I could still reach the website at port 80). So I SSH’ed into the AWS instance and did a docker ps
which - while slow - worked and showed the unhealthy gitlab + postman containers. I tried to stop them, but this yielded:
Error response from daemon: Cannot stop container r-gitlab-gitlab-1-3e6e511a: Cannot kill container <container-id>: rpc error: code = 14 desc = grpc: the connection is unavailable
So I tried sudo system-docker restart docker
, which seemed to work, but docker
didn’t come up properly:
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Finally I thought to restart the EC2 instance entirely, but since then docker
(version docker-17.03.2-ce
) is completely unresponsive and no container is reachable (website off-the-air, and Rancher UI too).
sudo system-docker logs docker
says:
time="2017-12-08T07:44:20Z" level=info msg="Starting Docker in context: console"
time="2017-12-08T07:44:20Z" level=error msg="non-200 http response: 404"
time="2017-12-08T07:44:20Z" level=error msg="Failed to load service(xenhvm-vm-tools): non-200 http response: 404"
time="2017-12-08T07:44:20Z" level=info msg="Getting PID for service: console"
time="2017-12-08T07:44:21Z" level=info msg="console PID 1037"
time="2017-12-08T07:44:21Z" level=info msg="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=<internal-ec2-hostname> HOME=/]"
time="2017-12-08T07:44:21Z" level=info msg="Running [docker-runc exec -- <system-docker-docker-container-id> env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=<internal-ec2-hostname> HOME=/ ros docker-init daemon --storage-driver overlay --host unix:///var/run/docker.sock --log-opt max-file=2 --log-opt max-size=25m --group docker]"
time="2017-12-08T07:44:21Z" level=info msg="Found /usr/bin/dockerd"
time="2018-02-20T10:36:24Z" level=info msg="Starting Docker in context: console"
time="2018-02-20T10:36:24Z" level=error msg="non-200 http response: 404"
time="2018-02-20T10:36:24Z" level=error msg="Failed to load service(xenhvm-vm-tools): non-200 http response: 404"
time="2018-02-20T10:36:24Z" level=info msg="Getting PID for service: console"
time="2018-02-20T10:36:24Z" level=info msg="console PID 900"
time="2018-02-20T10:36:24Z" level=info msg="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=<internal-ec2-hostname> HOME=/]"
time="2018-02-20T10:36:24Z" level=info msg="Running [docker-runc exec -- <system-docker-docker-container-id> env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=ip-10-0-0-24.eu-central-1.compute.internal HOME=/ ros docker-init daemon --host unix:///var/run/docker.sock --log-opt max-file=2 --log-opt max-size=25m --storage-driver overlay --group docker]"
time="2018-02-20T10:36:24Z" level=info msg="Found /usr/bin/dockerd"
This error looks like: https://github.com/rancher/os/issues/2244
But there the solution was to clear the cache at /var/lib/rancher/cache
, but my cache directory is already empty…
Where I turned up
Any docker
command I give through SSH hangs indefinitely. system-docker
is working fine.
Desired solution
My site is now down and Rancher unreachable. How can I return to a working state, preferably without the gitlab
and postman
containers stopped or removed, but not losing the other containers?