AlertManager shows KubeletDown, but cluster appears up-and-running ok

manuel-koch · May 25, 2021, 7:38am

In a Rancher cluster I have installed Prometheus/AlertManager helm charts.
Rancher: v2.5.5
rancher-monitoring: 9.4.202
k8s: v1.19.7

For some unknown reason the AlertManager shows the following alerts for some consecutive days now:

KubeSchedulerDown, https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeschedulerdown
KubeControllerManagerDown, https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontrollermanagerdown
KubeAPIDown, https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown
KubeletDown, https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown

Unfortunately those runbook links don’t give very helpful insides what the problem is or how it could be solved.
I don’t even know what is actually broken, as the cluster behaves normal ( at least it looks like ok ) and e.g. PODs can be scheduled / restarted and external traffic is handled as usual, I can access cluster from outside using kubectl or k9s.

So my question is: What is broken ? How can I fix it ?

manuel-koch · May 25, 2021, 7:43am

I notice that prometheus/grafana appear broken, it doesn’t show any data of the cluster, all dashboards only show “N/A” or “no data”.

manuel-koch · May 25, 2021, 7:56am

And Prometheus shows the following logs ( which I have seen every time in the past when the persistent volume for prometheus went down or had problems because of longhorn issues ):

prometheus level=warn ts=2021-05-25T07:48:36.616Z caller=manager.go:595 component=“rule manager” group=kube-scheduler.rules msg=“Rule sample appending failed” err=“write to WAL: log samples: write /prometheus/wal/00007531: read-only file system”
prometheus level=warn ts=2021-05-25T07:48:36.617Z caller=manager.go:595 component=“rule manager” group=kube-scheduler.rules msg=“Rule sample appending failed” err=“write to WAL: log samples: write /prometheus/wal/00007531: read-only file system”

Somehow prometheus doesn’t restart or indicate unhealthy status in such a scenario.
In the past I used to “fix” such problems by manually restarting prometheus POD.

Again, after restarting prometheus POD those “N/A” and “no data” in grafana disappeared and real data was shown.

manuel-koch · May 25, 2021, 7:59am

And now those strange alertmanager alerts ( KubeSchedulerDown, KubeControllerManagerDown, KubeAPIDown & KubeletDown ) resolved too

Looks like the problem was this prometheus POD not being able to access its underlying database.

Topic		Replies	Views
Alert manager and system-library-rancher-monitoring-0.0.2 not found Rancher	5	3212	December 24, 2019
Alerting is dead, missing alertmanager-operated Rancher	1	1209	March 3, 2020
Best Practice for updating Prometheus Alerts Rancher	0	219	March 6, 2023
Rancher 2.5.1/Prometheus-monitoring Rancher	2	544	November 20, 2020
Alertmanager configs in Monitoring V2 in Rancher no working	3	573	February 19, 2024

AlertManager shows KubeletDown, but cluster appears up-and-running ok

Related topics