KubeAPIErrorBudgetBurn on a close to empty cluster

Hello,

I have a k3s cluster with embedded etcd on ubuntu VM’s. It has 3 master nodes with 6G ram each and another 3 agents with 6G ram each.
It’s pretty much freshly installed, I added logging, longhorn, monitoring and rancher backups from the rancher helms.

After this and after configuring the alerting system I’m getting KubeAPIErrorBudgetBurn - The API server is burning too much error budget.
To be honest is the first time I’m seeing this and I’m not sure what to look for.

This is my API Server graphs which if I understand this correctly ErrorBudget is terrible somehow?

From the home page all seems pretty decent:

I found this online as well: Understanding the KubeAPIErrorBudgetBurn Alert Reason · Issue #464 · kubernetes-monitoring/kubernetes-mixin · GitHub
Looking into it they recomend increasing etcd resources, but I’m not sure how to do so since this is embedded etc :thinking: and since this is an empty cluster shouldn’t this be fine with a fresh install?

I’m really not sure what to look for so any help here would be appreciated.

Thanks

I found out that one of my node’s exporters was not reporting correctly for some reason, I don’t quite understand how but I went into it, checked the firewall and it was correct, so I just disabled the firewall to confirm, that node’s exporters went green on the prometheus…
I then re-enabled the firewall with no changes and it kept green some how…

Since then gradually the graphs have been becoming a bit better but it’s still wierdly negative in the error budget:

I’m not sure what I should or shouldn’t see in these graphs but any help to at least understand if there’s an issue here would be greatly appreciated.

Thanks

Does anyone know why this is happening?
or how I could check this?

Anyone knows anything?

did you find a solution? we are having the same issue and dont know what to do…

did not, for now I just stopped using monitoring all together.

We ran into the same issues. In our case it was network performance related.

@zacanbot Can you elaborate a bit more? I’m struggling with something similar and looking for solutions.

@aszmyd We moved the control plane nodes to faster machines (better disk and network), and the problem went away. It also goes without saying that running any workload on the same node will make the issue more likely to occur: we had a cluster with a single control plane node that also ran ingress on that node and exhibited the same behaviour. Running the ingress on a worker node solved the issue there.