I have a specific problem turning off cluster monitoring and also would welcome feedback on the pros and cons of Rancher’s built in metrics and whether people use this reliably in the wild.
We’re running Rancher 2.3.6 with two clusters at Kubernetes v1.17.4 on bare metal, each with three controller nodes, test with five worker nodes and prod with eight workers. All nodes are running Ubuntu 18.04.4LTS and docker 19.3.6.
Our test cluster is largely idle while prod is running a data transformation pipeline that submits a queue of one-shot jobs/pods, crunching data that sits on a shared NFS file system.
The prod system is also fairly lightly loaded: while there are sometimes a few thousand jobs submitted in a short time, what we’re doing is pretty small compared to what I see other shops doing with K8S clusters.
While Rancher’s fancy monitoring runs just fine on the test cluster, if I turn it on in prod it runs for a few hours/days/weeks and then the dashboard reports “Monitoring API is not ready.” At a few points over the last few months I’ve spent some time digging in the pod(s) that get stuck in a crash/backoff loop but I don’t recall the exact events I found in logs and describe but I recall they were OOM related even though the nodes have plenty of RAM available. So far my answer was to turn off monitoring and think I’d get back to it later. Then I’d turn it back on eventually, and start the circle again.
Specific Question: right now monitoring is broken. If I disable it in Rancher, I can click Disable and and Save, monitoring stays enabled and broken. I’ve also tried kubectl, deleting the deployment, replicaset, services/pods and even the entire cattle-prometheus namespace. In every case it all pops back up (still broken). I’m thinking as long as Rancher wants it running it’ll keep replacing anything I remove.
How do I turn off monitoring?
General Question: do people use this monitoring? The performance disclaimer at the top of the monitoring page in the webUI is a little discouraging:
" When enabling monitoring, you need to ensure your worker nodes and Prometheus pod have enough resources. Please visit [the Rancher docs for suggested resource limits."