Restarting Controller Manager and Scheduler

ralic · February 28, 2019, 7:54am

After single Node failure in a HA deployment Scheduler and Controller Manager are not working properly. When checking docker logs on a Node last messages are

 1 leaderelection.go:213] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
 1 controllermanager.go:215] leaderelection lost

When trying to restart docker container kube-controller-manager nothing happens.

Are there any suggestions what should I do?

superseb · February 28, 2019, 11:40am

If restarting the container does nothing, there is something else going on. docker restart kube-controller-manager should restart the container and you should get logging from the startup.

There is a troubleshooting section in the docs (https://rancher.com/docs/rancher/v2.x/en/troubleshooting/) where you can try to locate the root cause as well.

ralic · March 1, 2019, 8:06am

First of all @superseb thank you very much for your help.

It was a docker bug and I finally managed to restart everything and production is working properly.

I actually needed to restart whole docker service to make it work with

 sudo systemctl restart docker.socket docker.service

The only problem I have caused (and why I write this reply) is that I haven’t drained the compromised node before reloading which is silly but serious mistake.

So if someone has the same problem in the future:

Drain the compromised node (Rancher - Draining a node)
Reload docker service (SO - Cannot stop or restart a docker container)

Topic		Replies	Views
Schedular and controller restarts frequently - rancher rke Rancher	3	1213	December 7, 2020
Troubleshooting Controller Manager and Scheduler Unhealthy Issue Rancher	1	3699	June 12, 2023
Rancher Cluster Issue Rancher	1	3235	June 10, 2022
Rancher 2.7 on Docker fails start after server reboot Rancher	5	6332	March 7, 2023
Rancher api-server self heal Rancher	3	2950	April 24, 2019

Restarting Controller Manager and Scheduler

Related topics