Restarting Controller Manager and Scheduler

After single Node failure in a HA deployment Scheduler and Controller Manager are not working properly. When checking docker logs on a Node last messages are

 1 leaderelection.go:213] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
 1 controllermanager.go:215] leaderelection lost

When trying to restart docker container kube-controller-manager nothing happens.

Are there any suggestions what should I do?

If restarting the container does nothing, there is something else going on. docker restart kube-controller-manager should restart the container and you should get logging from the startup.

There is a troubleshooting section in the docs (https://rancher.com/docs/rancher/v2.x/en/troubleshooting/) where you can try to locate the root cause as well.

1 Like

First of all @superseb thank you very much for your help.

It was a docker bug and I finally managed to restart everything and production is working properly.

I actually needed to restart whole docker service to make it work with

 sudo systemctl restart docker.socket docker.service

The only problem I have caused (and why I write this reply) is that I haven’t drained the compromised node before reloading which is silly but serious mistake.

So if someone has the same problem in the future:

  1. Drain the compromised node (Rancher - Draining a node)
  2. Reload docker service (SO - Cannot stop or restart a docker container)