Rancher causing Docker lock-up?

Hi,
We’ve installed Rancher on a RHEL cluster, and set up the server separate from all the K8S related nodes. We’ve successfully set up the cluster, deployed test workloads using nginx, and got the ingress controller working. The K8S nodes are all running happily for several days. However, the machine running the Rancher server develops problems after a couple of days. The Rancher server UI isn’t reachable, and also sshd quits responding to attempts to reach the machine. Then, sshd mysteriously started working again, but when I attempt to do anything related to Docker, the command hangs and does not respond.

We’ve seen this behavior on two different RHEL VMs running the Rancher server. I want to think that nothing happening inside a Docker container could possibly impact OS level resources like sshd, but this is looking like a might strong coincidence.

Has anyone else seen this?

1 Like

Yes. Today. And for me this is the second time I get such a severe crash. The last time I ended up completely recreating everything as the rancher server was on one of my nodes. Now we have a situation with rancher-server separated onto it’s own hardware. Last week the crash, which looks similar, happened again. We noticed the crash because our rancher web interface was unresponsive. After a reboot of the server, the docker process hung in ‘activating’ phase. When I manually remove the container directory from /var/lib/docker/containers/, docker will start, but without rancher-server.