RancherOS bare metal no longer starts [SOLVED]

Hi,

I’m running RancherOS bare-metal on an Intel i5 NUC and have come up against a strange issue.

Running RancherOS 0.4.3.

Things were working fine until I shut the server down to move it from one location in my house to another. It had been up for about 8-9 days to this point. The box had been shut down and rebooted many times previously without issue.

Upon powering the system back up I could no longer SSH into the box or connect to the Rancher management web UI. So I took the box back to my keyboard/monitor and find that system-docker no longer appears to start.

I’m met with continuous ‘Waiting for docker at unix:///var/run/system-docker.sock’ messages, until eventually they time out.

The following is the only thing that seems to jump out at me wrt. to the boot messages:

… Mounting state device /dev/sda1 to /state
… EXT4-fs (sda1): recovery complete
… EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
… Launching System Docker

Not sure where to go from here. I have no shell and there aren’t any specific error messages related to docker appearing. Nothing on the keyboard seems to work, I cannot CTRL-ALT-DEL to reset the box for example. I also don’t think it’s getting on the network - it’s not getting to the point where it can get a DHCP address.

Any help is greatly appreciated.

Matt

Hey all,

For the benefit of anyone else that comes across a similar issue, I did manage to solve my problem.

My server had run into this known Docker issue #17083


(the forum isn’t letting me post this as a link, it’s complaining I can only have two links because I’m a new user).

This seems to be fixed in an existing PR (forum is telling me I have more than two links in my post for some odd reason so I can’t link it here but it’s referenced in the linked discussion).

Basically, Docker shutdown may have been incomplete - perhaps I powered off the box too early or something went wrong (my NUC tends to hang sometimes when doing sudo shutdown -h now from a shell, so it’s likely I powered off at just the wrong time). If network references are left hanging in either of the Docker /var/lib/…/network folders (/var/lib/docker/network or /var/lib/system-docker/network), the related docker daemon will fail to start. In my case that was system-docker.

To anyone else having the same issue:

  • Boot into a live Linux environment (I used a GParted USB stick), mount your STATE device somewhere and check the logs.
  • If you see the message ‘Could not delete local endpoint…’, you may have the same issue (see the first link above).
  • Remove the folders /var/lib/system-docker/network and /var/lib/docker/network from your STATE device. I also removed /var/lib/docker/tmp and /var/lib/system-docker/tmp, though this may not be necessary.
  • Shutdown & restart - server should be back up.

Hope this helps.

Matt

1 Like