It all started a few days ago: for some reason my deployments weren’t available any more, so I had a look at the rancher dashboard and it told me that Controller Manager
and Scheduler
were both unhealthy. So i first connected to the VM where both were running on, and I noticed a very high CPU usage where docker took about 300-400% of CPU. So I decided to restart the VM, and a few minutes later everything was fine again.
But since that day, I experience almost daily problems with the cluster. It starts after a few hours and the VM on which Controller Manager
, Scheduler
, and Etcd
are running get’s high on CPU, where the docker process again takes up 200-300% of CPU and over time more and more runc
processes are started. If I would not restart the VM, a few hours later the VM turns unusable because of the high CPU load. Further, deployments are nearly impossible because pods are get stuck on Creating container
message and all pods turn into Unknown
status. The cluster overview of rancher tells me that most of the time either one VM or both VMs are not active
.
Both VMs have 16GB of RAM 4 Core CPUs. I didn’t change anything on the cluster configuration over the last months. The cluster went well for almost two years now.
Anyone else experienced those problems? Honestly, I don’t know what to do.
Kubernetes Version: v1.16.3
Rancher Version: 2.3.3
Docker Version: Main VM: 18.09.6 Worker VM: 19.03.8
This is the recent event log: