Monitoring etcd health when Rancher is down?

It happened again! Our Rancher 2 deployment (Local cluster and a user workload cluster) went down, and the root cause was that our three etcd nodes couldn’t talk to each other due to a network configuration issue. Unfortunately, since etcd was down, Rancher itself was down and thus the Rancher alerting system was not functional.

What strategies have you employed to keep an eye on your etcd cluster in the event that Rancher itself is down? Do you use etcdctl?

That’s always the danger of coupling yourself to the Rancher API, CLI or UI. You don’t have to do that (we certainly don’t, and mostly for this type of reason).

What’s your strategy for monitoring etcd and other core parts of Rancher 2 then?

Hi Stefan, usually the idea is to have an external monitoring that don’t rely on the system is monitoring.

If you have any ressources external rancher, you can deploy a Prometheus that will monitor from out the cluster critical components (master hosts/etcd/network)

We use a number of tools that operate at the infrastructure level, platform and application. I work for a large corporate so much of this is predefined and I often don’t have latitude to choose alternates (there are support wrap and technical approval considerations), but that said, we use Prometheus/Grafana along with Splunk, App Dynamics and Cloud Watch. Applications share some of these and also others like Seq.

Sorry if I wasn’t clear. I was trying to figure out what low-level tooling could I use to plug into our existing monitoring systems.

At a low level, etcd can be monitored via HTTP, such as this example using curl:

mgmt01 # curl https://etcd1.example.org:2379/health
{"health":"true"}
#

And also via etcdctl cluster-health:

# etcdctl --endpoints https://etcd1:2379/ cluster-health
member asd123asd123as is healthy: got healthy result from https://etcd2:2379
member 0978asd098asd9 is healthy: got healthy result from https://cntest07:2379
member 345kjh345jkh34 is healthy: got healthy result from https://etcd3:2379
cluster is healthy

And generally speaking, it sounds like it might be wide to have a Grafana/Prometheus cluster outside of Rancher.

You can inspect and troubleshoot the Rancher local cluster etcd as described here. Basically passing etcdctl commands into the etcd containers

https://rancher.com/docs/rancher/v2.x/en/troubleshooting/kubernetes-components/etcd/

1 Like