It happened again! Our Rancher 2 deployment (Local cluster and a user workload cluster) went down, and the root cause was that our three etcd nodes couldn’t talk to each other due to a network configuration issue. Unfortunately, since etcd was down, Rancher itself was down and thus the Rancher alerting system was not functional.
What strategies have you employed to keep an eye on your etcd cluster in the event that Rancher itself is down? Do you use etcdctl?
That’s always the danger of coupling yourself to the Rancher API, CLI or UI. You don’t have to do that (we certainly don’t, and mostly for this type of reason).
Hi Stefan, usually the idea is to have an external monitoring that don’t rely on the system is monitoring.
If you have any ressources external rancher, you can deploy a Prometheus that will monitor from out the cluster critical components (master hosts/etcd/network)
We use a number of tools that operate at the infrastructure level, platform and application. I work for a large corporate so much of this is predefined and I often don’t have latitude to choose alternates (there are support wrap and technical approval considerations), but that said, we use Prometheus/Grafana along with Splunk, App Dynamics and Cloud Watch. Applications share some of these and also others like Seq.
# etcdctl --endpoints https://etcd1:2379/ cluster-health
member asd123asd123as is healthy: got healthy result from https://etcd2:2379
member 0978asd098asd9 is healthy: got healthy result from https://cntest07:2379
member 345kjh345jkh34 is healthy: got healthy result from https://etcd3:2379
cluster is healthy