KubeAPIErrorBudgetBurn on a close to empty cluster

jlourenco · July 14, 2021, 10:22am

Hello,

I have a k3s cluster with embedded etcd on ubuntu VM’s. It has 3 master nodes with 6G ram each and another 3 agents with 6G ram each.
It’s pretty much freshly installed, I added logging, longhorn, monitoring and rancher backups from the rancher helms.

After this and after configuring the alerting system I’m getting KubeAPIErrorBudgetBurn - The API server is burning too much error budget.
To be honest is the first time I’m seeing this and I’m not sure what to look for.

This is my API Server graphs which if I understand this correctly ErrorBudget is terrible somehow?

From the home page all seems pretty decent:

I found this online as well: Understanding the KubeAPIErrorBudgetBurn Alert Reason · Issue #464 · kubernetes-monitoring/kubernetes-mixin · GitHub
Looking into it they recomend increasing etcd resources, but I’m not sure how to do so since this is embedded etc and since this is an empty cluster shouldn’t this be fine with a fresh install?

I’m really not sure what to look for so any help here would be appreciated.

Thanks

jlourenco · July 14, 2021, 5:26pm

I found out that one of my node’s exporters was not reporting correctly for some reason, I don’t quite understand how but I went into it, checked the firewall and it was correct, so I just disabled the firewall to confirm, that node’s exporters went green on the prometheus…
I then re-enabled the firewall with no changes and it kept green some how…

Since then gradually the graphs have been becoming a bit better but it’s still wierdly negative in the error budget:

I’m not sure what I should or shouldn’t see in these graphs but any help to at least understand if there’s an issue here would be greatly appreciated.

Thanks

jlourenco · July 19, 2021, 7:56am

Does anyone know why this is happening?
or how I could check this?

jlourenco · July 24, 2021, 10:07pm

Anyone knows anything?

DRHUDuV9uWoPiDo0DecX · September 22, 2021, 8:33pm

did you find a solution? we are having the same issue and dont know what to do…

jlourenco · September 23, 2021, 9:52am

did not, for now I just stopped using monitoring all together.

zacanbot · January 26, 2022, 4:13pm

We ran into the same issues. In our case it was network performance related.

aszmyd · April 22, 2022, 9:10am

@zacanbot Can you elaborate a bit more? I’m struggling with something similar and looking for solutions.

zacanbot · April 22, 2022, 1:23pm

@aszmyd We moved the control plane nodes to faster machines (better disk and network), and the problem went away. It also goes without saying that running any workload on the same node will make the issue more likely to occur: we had a cluster with a single control plane node that also ran ingress on that node and exhibited the same behaviour. Running the ingress on a worker node solved the issue there.

Topic		Replies	Views
Solving a Customer Cluster problem: Failed to reconcile etcd plane: Etcd plane nodes are replaced Rancher	0	2313	April 24, 2020
Misleading Alert regarding Etcd status with an external cluster Rancher	0	350	September 17, 2020
Kube-apiserver, etcd problems with Rancher and Harbor HA Rancher	1	531	December 7, 2022
Unable to provision Vsphere cluster in rancher Rancher	4	701	August 4, 2021
Automatic etcd snapshots are failing Rancher	1	858	February 10, 2020

KubeAPIErrorBudgetBurn on a close to empty cluster

Related topics