Hi,
I’m running RancherOS bare-metal on an Intel i5 NUC and have come up against a strange issue.
Running RancherOS 0.4.3.
Things were working fine until I shut the server down to move it from one location in my house to another. It had been up for about 8-9 days to this point. The box had been shut down and rebooted many times previously without issue.
Upon powering the system back up I could no longer SSH into the box or connect to the Rancher management web UI. So I took the box back to my keyboard/monitor and find that system-docker no longer appears to start.
I’m met with continuous ‘Waiting for docker at unix:///var/run/system-docker.sock’ messages, until eventually they time out.
The following is the only thing that seems to jump out at me wrt. to the boot messages:
… Mounting state device /dev/sda1 to /state
… EXT4-fs (sda1): recovery complete
… EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
… Launching System Docker
Not sure where to go from here. I have no shell and there aren’t any specific error messages related to docker appearing. Nothing on the keyboard seems to work, I cannot CTRL-ALT-DEL to reset the box for example. I also don’t think it’s getting on the network - it’s not getting to the point where it can get a DHCP address.
Any help is greatly appreciated.
Matt
Hey all,
For the benefit of anyone else that comes across a similar issue, I did manage to solve my problem.
My server had run into this known Docker issue #17083
(the forum isn’t letting me post this as a link, it’s complaining I can only have two links because I’m a new user).
This seems to be fixed in an existing PR (forum is telling me I have more than two links in my post for some odd reason so I can’t link it here but it’s referenced in the linked discussion).
Basically, Docker shutdown may have been incomplete - perhaps I powered off the box too early or something went wrong (my NUC tends to hang sometimes when doing sudo shutdown -h now from a shell, so it’s likely I powered off at just the wrong time). If network references are left hanging in either of the Docker /var/lib/…/network folders (/var/lib/docker/network or /var/lib/system-docker/network), the related docker daemon will fail to start. In my case that was system-docker.
To anyone else having the same issue:
- Boot into a live Linux environment (I used a GParted USB stick), mount your STATE device somewhere and check the logs.
- If you see the message ‘Could not delete local endpoint…’, you may have the same issue (see the first link above).
- Remove the folders /var/lib/system-docker/network and /var/lib/docker/network from your STATE device. I also removed /var/lib/docker/tmp and /var/lib/system-docker/tmp, though this may not be necessary.
- Shutdown & restart - server should be back up.
Hope this helps.
Matt
1 Like