Can't remove cluster monitoring, several questions/issues

randyrue · July 27, 2020, 5:16pm

I have a specific problem turning off cluster monitoring and also would welcome feedback on the pros and cons of Rancher’s built in metrics and whether people use this reliably in the wild.

We’re running Rancher 2.3.6 with two clusters at Kubernetes v1.17.4 on bare metal, each with three controller nodes, test with five worker nodes and prod with eight workers. All nodes are running Ubuntu 18.04.4LTS and docker 19.3.6.

Our test cluster is largely idle while prod is running a data transformation pipeline that submits a queue of one-shot jobs/pods, crunching data that sits on a shared NFS file system.

The prod system is also fairly lightly loaded: while there are sometimes a few thousand jobs submitted in a short time, what we’re doing is pretty small compared to what I see other shops doing with K8S clusters.

While Rancher’s fancy monitoring runs just fine on the test cluster, if I turn it on in prod it runs for a few hours/days/weeks and then the dashboard reports “Monitoring API is not ready.” At a few points over the last few months I’ve spent some time digging in the pod(s) that get stuck in a crash/backoff loop but I don’t recall the exact events I found in logs and describe but I recall they were OOM related even though the nodes have plenty of RAM available. So far my answer was to turn off monitoring and think I’d get back to it later. Then I’d turn it back on eventually, and start the circle again.

Specific Question: right now monitoring is broken. If I disable it in Rancher, I can click Disable and and Save, monitoring stays enabled and broken. I’ve also tried kubectl, deleting the deployment, replicaset, services/pods and even the entire cattle-prometheus namespace. In every case it all pops back up (still broken). I’m thinking as long as Rancher wants it running it’ll keep replacing anything I remove.

How do I turn off monitoring?

General Question: do people use this monitoring? The performance disclaimer at the top of the monitoring page in the webUI is a little discouraging:

" When enabling monitoring, you need to ensure your worker nodes and Prometheus pod have enough resources. Please visit [the Rancher docs for suggested resource limits."

randyrue · July 28, 2020, 2:37pm

Oh hell. Figured out the disabling, anyway.

When I click disable the Save button pops up with a red “Are you sure?” box next to it. Just realized that red box is the “yes” button.

Am I clueless or could that UI be better laid out? Maybe replace that Save button with a Cancel button?

Topic		Replies	Views
Rancher Monitoring with K3s Cluster Rancher	0	715	August 13, 2019
Rancher 2.5.1 cannot install new monitoring Rancher	4	2922	December 2, 2020
Completely disabling monitoring and alarming Rancher	14	7210	February 17, 2021
Using Rancher Alerts to monitor for cattle-system pods that restart frequently?	1	1179	January 14, 2021
Rancher Cluster Monitoring - Ignore Fargate Nodes Rancher	1	560	July 30, 2020

Can't remove cluster monitoring, several questions/issues

Related topics