Cluster turns unhealthy after a few hours

yss14 · September 9, 2020, 11:41am

It all started a few days ago: for some reason my deployments weren’t available any more, so I had a look at the rancher dashboard and it told me that Controller Manager and Scheduler were both unhealthy. So i first connected to the VM where both were running on, and I noticed a very high CPU usage where docker took about 300-400% of CPU. So I decided to restart the VM, and a few minutes later everything was fine again.
But since that day, I experience almost daily problems with the cluster. It starts after a few hours and the VM on which Controller Manager , Scheduler, and Etcd are running get’s high on CPU, where the docker process again takes up 200-300% of CPU and over time more and more runc processes are started. If I would not restart the VM, a few hours later the VM turns unusable because of the high CPU load. Further, deployments are nearly impossible because pods are get stuck on Creating container message and all pods turn into Unknown status. The cluster overview of rancher tells me that most of the time either one VM or both VMs are not active.
Both VMs have 16GB of RAM 4 Core CPUs. I didn’t change anything on the cluster configuration over the last months. The cluster went well for almost two years now.

Anyone else experienced those problems? Honestly, I don’t know what to do.

Kubernetes Version: v1.16.3
Rancher Version: 2.3.3
Docker Version: Main VM: 18.09.6 Worker VM: 19.03.8

This is the recent event log:

Topic		Replies	Views
Troubleshooting Controller Manager and Scheduler Unhealthy Issue Rancher	1	3701	June 12, 2023
Alert: Component scheduler is unhealthy Rancher	2	1957	April 1, 2021
Rancher 2+ node cluster dies overnight - reproducable with multiple OSs Rancher 1.x	1	1121	September 20, 2017
Small cluster has memory increasing and CPU Rancher	2	1640	July 30, 2018
Schedular and controller restarts frequently - rancher rke Rancher	3	1215	December 7, 2020

Cluster turns unhealthy after a few hours

Related topics