Monitoring etcd health when Rancher is down?

Stefan_Lasiewski · October 17, 2019, 10:59pm

It happened again! Our Rancher 2 deployment (Local cluster and a user workload cluster) went down, and the root cause was that our three etcd nodes couldn’t talk to each other due to a network configuration issue. Unfortunately, since etcd was down, Rancher itself was down and thus the Rancher alerting system was not functional.

What strategies have you employed to keep an eye on your etcd cluster in the event that Rancher itself is down? Do you use etcdctl?

Fraser_Goffin · October 22, 2019, 12:22pm

That’s always the danger of coupling yourself to the Rancher API, CLI or UI. You don’t have to do that (we certainly don’t, and mostly for this type of reason).

Stefan_Lasiewski · October 22, 2019, 5:03pm

What’s your strategy for monitoring etcd and other core parts of Rancher 2 then?

nicosto · October 22, 2019, 8:49pm

Hi Stefan, usually the idea is to have an external monitoring that don’t rely on the system is monitoring.

If you have any ressources external rancher, you can deploy a Prometheus that will monitor from out the cluster critical components (master hosts/etcd/network)

Fraser_Goffin · November 5, 2019, 9:28pm

We use a number of tools that operate at the infrastructure level, platform and application. I work for a large corporate so much of this is predefined and I often don’t have latitude to choose alternates (there are support wrap and technical approval considerations), but that said, we use Prometheus/Grafana along with Splunk, App Dynamics and Cloud Watch. Applications share some of these and also others like Seq.

Stefan_Lasiewski · November 5, 2019, 10:34pm

Sorry if I wasn’t clear. I was trying to figure out what low-level tooling could I use to plug into our existing monitoring systems.

At a low level, etcd can be monitored via HTTP, such as this example using curl:

mgmt01 # curl https://etcd1.example.org:2379/health
{"health":"true"}
#

And also via etcdctl cluster-health:

# etcdctl --endpoints https://etcd1:2379/ cluster-health
member asd123asd123as is healthy: got healthy result from https://etcd2:2379
member 0978asd098asd9 is healthy: got healthy result from https://cntest07:2379
member 345kjh345jkh34 is healthy: got healthy result from https://etcd3:2379
cluster is healthy

Stefan_Lasiewski · November 6, 2019, 1:05am

And generally speaking, it sounds like it might be wide to have a Grafana/Prometheus cluster outside of Rancher.

jpeake · November 6, 2019, 5:16pm

You can inspect and troubleshoot the Rancher local cluster etcd as described here. Basically passing etcdctl commands into the etcd containers

https://rancher.com/docs/rancher/v2.x/en/troubleshooting/kubernetes-components/etcd/

Topic		Replies	Views
Monitoring options for ETCD	1	826	May 12, 2018
Misleading Alert regarding Etcd status with an external cluster Rancher	0	350	September 17, 2020
Rancher API Server Rancher	0	1142	July 9, 2020
Running Rancher Server in HA Rancher	3	1152	August 23, 2018
Monitoring etcd wit Rancher cluster monitoring Rancher	1	750	January 27, 2020

Monitoring etcd health when Rancher is down?

Related topics