Hello all,
Our rancher-server crashed. We noticed this because our web interface was no longer available. The active nodes and the load balancers are still working, we can still access our various webapplications running in the various containers.
When we rebooted the server, docker did no longer manage to get back up. I have been debugging it for a couple of hours and finally discovered that after removing the folder for the faulty container from /var/lib/docker/containers/, docker came back up.
Our machine running rancher server has only 2 containers: 1 for rancher-data and one for rancher-server.
I actually managed to solve it. By removing the entire directory under var/lib/docker/containers/ for the rancher-server, and then reissuing the command to start rancher server using the volumes from the rancher-data container (which was in stopped state and came available)
docker run -d \
--volumes-from rancher-data \
--name rancher-server-v2.2.4 \
--restart=unless-stopped -p 80:80 -p 443:443 \
-v /home/certs/yourdomain.pem:/etc/rancher/ssl/cert.pem \
-v /home/certs/certificate-yourdomain.key:/etc/rancher/ssl/key.pem \
rancher/rancher:v2.2.4 --no-cacerts
And within a couple of seconds, everything was up and running again.
1 Like
Ok, but that’s a pretty disruptive fix if you want to avoid a service outage. You may want to consider a HA setup. OTOH, as you pointed out, access to your business applications was unaffected by losing the Rancher server, but of course you do then lose the ability to manage your cluster and workloads via Rancher (albeit you can still do so using the K8s api directly - which tbh is what we do, mostly for portability). Another gotcha might exist if you drive all authentication thru Rancher, which, if you lose Rancher server will be impactful. It’s possible to avoid that in the version you are using.