We recently found that our Rancher UI was down. Said there was a proxy/firewall issue but no 504etc. and the subdomain is fine. That is not the case (just the browser).
The first hunch was possibly the Let’s Encrypt cert expired and that was the cause. Don’t think that is it either.
The docker container is there, and when we go in after 10 or so seconds we get booted out. Looks like the etcd is restarting constanly on a boot loop or something. Plenty of resources.
We tried to power down, all the compenent droplets, but that didn’t do any good either.
Being unable to connect remote and use the CLI, any ideas what the cause or solutions we should try would be to;
The UI is just static HTML/JS files, so not being able to get to it is just a symptom and not your actual problem.
Etcd restarting constantly basically means you have no cluster, in which Rancher is supposed to be running, which provides the API, which serves up the UI assets.
But there’s not much anyone can tell you in detail given just that “a boot loop or something” is happening.
The only expression of the issues is etcd rebooting.
Would you have a suggested next couple of steps to troubleshoot when you have no cluster? etcd is restarting constantly is a better way to define it technically. But feeling rather directionless on correcting and cause determination. Here is a video of the logs but can’t capture and they are going so fast due to the resarting I think. We see all 3 VMs that form the cluster, maybe the VMs are unable to “talk” to each other which prevents them to form a cluster?
Hopefully I caught enough of this, hard to tell where it loops.