User-docker keeps restarting

I seem to be having the same problem as aschrijver did in this thread, difference being: I cannot easily just “nuke it and start over” as the server is deployed in a remote location (I had selected RancherOS purposefully as it looked like among the least likely to “randomly break”…apparently this was not a good assumption)

a very quick summary of the problem. A few minutes after a reboot, the following commands (pay special attention to the “docker” image, which has been restarting over and over for the last 10 minutes, with completely uninformative logfiles):

[rancher@rancher ~]$ docker container ls -a
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.38/containers/json?all=1: dial unix /var/run/docker.sock: connect: permission denied
[rancher@rancher ~]$ sudo docker container ls -a
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
[rancher@rancher ~]$ sudo system-docker container ls -a
CONTAINER ID        IMAGE                              COMMAND                  CREATED             STATUS                          PORTS               NAMES
f5b1b2a44316        rancher/os-console:v1.5.3          "/usr/bin/ros entr..."   10 minutes ago      Up 10 minutes                                       console
c531010aeef9        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   10 minutes ago      Created                                             all-volumes
a58cd29a685e        rancher/os-console:v1.5.3          "/usr/bin/ros entr..."   10 minutes ago      Exited (255) 10 minutes ago                         reboot
683c7cd8bbcf        rancher/os-docker:18.06.3          "ros user-docker"        20 minutes ago      Restarting (1) 26 seconds ago                       docker
b0c20bc40821        rancher/os:v1.5.3                  "/bin/ros install ..."   21 minutes ago      Exited (255) 21 minutes ago                         once_os-upgrade_1
99471dfe162e        rancher/os-acpid:v1.5.3            "/usr/bin/ros entr..."   3 weeks ago         Exited (255) 3 weeks ago                            poweroff
e2c152223480        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   4 weeks ago         Exited (0) 9 minutes ago                            preload-user-images
3553495a024a        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   4 weeks ago         Exited (1) 10 minutes ago                           subscriber
9a302ec5e077        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   4 weeks ago         Exited (0) 10 minutes ago                           cloud-init-execute
a18f9bf8b9da        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   4 weeks ago         Up 10 minutes                                       ntp
9657d19916b2        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   4 weeks ago         Up 10 minutes                                       network
1d3e662bcea8        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   4 weeks ago         Up 10 minutes                                       udev
1fb30afb259d        rancher/container-crontab:v0.4.0   "container-crontab"      4 weeks ago         Up 10 minutes                                       system-cron
e4a7cd582059        rancher/os-logrotate:v1.5.3        "/usr/bin/entrypoi..."   4 weeks ago         Exited (0) About an hour ago                        logrotate
2c9cc048d474        rancher/os-syslog:v1.5.3           "/usr/bin/entrypoi..."   4 weeks ago         Up 9 minutes                                        syslog
6197f00d9655        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   4 weeks ago         Exited (0) 10 minutes ago                           udev-cold
1398048da474        rancher/os-acpid:v1.5.3            "/usr/bin/ros entr..."   4 weeks ago         Up 10 minutes                                       acpid
fb53ba4934f4        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   4 weeks ago         Created                                             user-volumes
9ff391db54d5        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   4 weeks ago         Created                                             system-volumes
00c89d888a7e        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   4 weeks ago         Created                                             media-volumes
66150408f9a3        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   4 weeks ago         Created                                             container-data-volumes
ce43b7535809        rancher/os-base:v1.5.3             "/usr/bin/ros entr..."   4 weeks ago         Created                                             command-volumes
[rancher@rancher ~]$

I tried using ros engine to toggle Docker to docker-18.09.8 and back to docker-18.06.3-ce, but that didn’t fix it.

How can I debug the user-docker service and try to, if nothing else, reset it to factory settings? (sudo rm -rfv /var/lib/rancher/cache deleted some files, but did not solve the problem.)

for completeness’ sake, I’ll include this:

[rancher@rancher ~]$ sudo system-docker logs docker
time="2019-08-12T17:30:35Z" level=info msg="Starting Docker in context: console"
time="2019-08-12T17:30:35Z" level=info msg="Getting PID for service: console"
time="2019-08-12T17:30:36Z" level=info msg="console PID 949"
time="2019-08-12T17:30:36Z" level=info msg="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=rancher HOME=/]"
time="2019-08-12T17:30:36Z" level=info msg="Running [system-docker-runc exec -- fbfb897138e67697ee33727d190c34455aed71fb5517ac8dcf3d695d4f0cd68f env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=rancher HOME=/ ros docker-init --log-opt max-size=25m --log-opt max-file=2 --group docker --host unix:///var/run/docker.sock]"
time="2019-08-12T17:30:36Z" level=info msg="Found /usr/bin/dockerd"
time="2019-08-12T17:30:48Z" level=info msg="Starting Docker in context: console"
time="2019-08-12T17:30:48Z" level=info msg="Getting PID for service: console"
time="2019-08-12T17:30:48Z" level=info msg="console PID 949"
time="2019-08-12T17:30:48Z" level=info msg="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=rancher HOME=/]"
time="2019-08-12T17:30:48Z" level=info msg="Running [system-docker-runc exec -- fbfb897138e67697ee33727d190c34455aed71fb5517ac8dcf3d695d4f0cd68f env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=rancher HOME=/ ros docker-init --group docker --host unix:///var/run/docker.sock --log-opt max-size=25m --log-opt max-file=2]"
time="2019-08-12T17:30:48Z" level=info msg="Found /usr/bin/dockerd"
[… 47 essentially-identical iterations of the preceding/following
6-line chunks omitted for brevity …]
time="2019-08-12T17:57:53Z" level=info msg="Starting Docker in context: console"
time="2019-08-12T17:57:53Z" level=info msg="Getting PID for service: console"
time="2019-08-12T17:57:53Z" level=info msg="console PID 939"
time="2019-08-12T17:57:53Z" level=info msg="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=rancher HOME=/]"
time="2019-08-12T17:57:53Z" level=info msg="Running [system-docker-runc exec -- f5b1b2a44316a9de4832653a2a48f290b91239c1b0fd2d0dc3aa9efecae0bb8c env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=rancher HOME=/ ros docker-init --log-opt max-file=2 --log-opt max-size=25m --group docker --host unix:///var/run/docker.sock]"
time="2019-08-12T17:57:53Z" level=info msg="Found /usr/bin/dockerd"
time="2019-08-12T17:58:56Z" level=info msg="Starting Docker in context: console"
time="2019-08-12T17:58:56Z" level=info msg="Getting PID for service: console"
time="2019-08-12T17:58:56Z" level=info msg="console PID 939"
time="2019-08-12T17:58:56Z" level=info msg="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=rancher HOME=/]"
time="2019-08-12T17:58:56Z" level=info msg="Running [system-docker-runc exec -- f5b1b2a44316a9de4832653a2a48f290b91239c1b0fd2d0dc3aa9efecae0bb8c env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin HOSTNAME=rancher HOME=/ ros docker-init --host unix:///var/run/docker.sock --group docker --log-opt max-file=2 --log-opt max-size=25m]"
time="2019-08-12T17:58:56Z" level=info msg="Found /usr/bin/dockerd"
[rancher@rancher ~]$

How can I get better / more-informative logs out of this?

I have even tried pruning my user_config.yml file (the only one I have modified) down to literally just the headerline and SSH key entry, and yet the User Docker still seems to be in “some kind of state” even after rebooting the machine.

Or, even if there’s just a safe+reliable way to nuke /dev/sda and reinstall RancherOS over SSH (it runs in RAM…right?), that would be a good stopgap.

OK, the following seems to have worked (all as root):

warning: this is just how to nuke the user-docker if you don’t have physical access to boot into a live OS or such. It doesn’t actually “fix” the problem that occurred, it’s just a stopgap to get the system running again without physical access.

  1. Edit /usr/share/ros/os-config.yml and add a new volume /:/ROOOT underneath console (its volumes were around line 235 on mine), then reboot
  2. ros s stop docker
  3. system-docker container stop docker
  4. mv -v /ROOOT/var/lib/user-docker /ROOOT/var/lib/user-docker.BAK, then reboot
  5. docker container is (hopefully) no longer fubared, feel free to delete or salvage items out of the .BAK if you so choose

the problem is that the “actual baremetal root” doesn’t seem to be exposed, so there’s no way to “for sure” nuke {whatever it was} that’s causing the user-docker to choke and die without so much as an error message. That still seems like a bug which absolutely must be fixed, and I’ll be able and ready to provide any diagnostic information needed the instant it happens again.

This works to completly reset the user docker. Is there a way to get back all the docker volumes which I had before?