AlertManager shows KubeletDown, but cluster appears up-and-running ok

In a Rancher cluster I have installed Prometheus/AlertManager helm charts.
Rancher: v2.5.5
rancher-monitoring: 9.4.202
k8s: v1.19.7

For some unknown reason the AlertManager shows the following alerts for some consecutive days now:

KubeSchedulerDown, https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeschedulerdown
KubeControllerManagerDown, https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontrollermanagerdown
KubeAPIDown, https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown
KubeletDown, https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown

Unfortunately those runbook links don’t give very helpful insides what the problem is or how it could be solved.
I don’t even know what is actually broken, as the cluster behaves normal ( at least it looks like ok ) and e.g. PODs can be scheduled / restarted and external traffic is handled as usual, I can access cluster from outside using kubectl or k9s.

So my question is: What is broken ? How can I fix it ?

I notice that prometheus/grafana appear broken, it doesn’t show any data of the cluster, all dashboards only show “N/A” or “no data”.

And Prometheus shows the following logs ( which I have seen every time in the past when the persistent volume for prometheus went down or had problems because of longhorn issues ):

prometheus level=warn ts=2021-05-25T07:48:36.616Z caller=manager.go:595 component=“rule manager” group=kube-scheduler.rules msg=“Rule sample appending failed” err=“write to WAL: log samples: write /prometheus/wal/00007531: read-only file system”
prometheus level=warn ts=2021-05-25T07:48:36.617Z caller=manager.go:595 component=“rule manager” group=kube-scheduler.rules msg=“Rule sample appending failed” err=“write to WAL: log samples: write /prometheus/wal/00007531: read-only file system”

Somehow prometheus doesn’t restart or indicate unhealthy status in such a scenario.
In the past I used to “fix” such problems by manually restarting prometheus POD.

Again, after restarting prometheus POD those “N/A” and “no data” in grafana disappeared and real data was shown.

And now those strange alertmanager alerts ( KubeSchedulerDown, KubeControllerManagerDown, KubeAPIDown & KubeletDown ) resolved too :slight_smile:

Looks like the problem was this prometheus POD not being able to access its underlying database.