Hi team,
After the deployment of RKE cluster using the YAML file cluster was up and running for 3-4 days then later we found scheduler and controller is frequently restarting due to failure to renew lease kube-system/kube-scheduler: failed to tryacquireorrenew context deadline exceeded
Troubleshoot:
restarted the exited master components, docker service and redeployed the rke cluster
and many more but no changes on scheduler and controller
Nodes are healthy with resources
@rancher_admin @superseb @mathieu-gilloots
Request you to please look into this bug as to why this scheduler and controller is restarting
more often because this impacts the production business.
Please share more info about the setup, this is usually caused because the nodes are running out of resources. Are the nodes from the screenshot the only nodes that are in the cluster? It is recommended t o have at least 3 etcd nodes and 2 controlplane nodes to make sure it remains available when one of those go down. Please share specifications of the nodes (host OS/Docker version/CPU/Memory/Disk type+IOPS), and the exact logging from when it happens. Also in this case, please share the logging of the etcd
container.
Added 2 more control plane but occurs same issue in the other nodes as well such as restarting the containers of scheduler and controller at every 1-5 minutes.
Regarding resources, all are healthy with ram, cpu cores and disk.
OS: Red Hat Enterprise Linux Server release 7.5 (Maipo)
disk: lvm
Please supply the requested info so we can diagnose the issue, there is something wrong but saying everything is fine will not help in diagnosing the issue. An LVM disk also does not say anything about the requested info, etcd has an IO requirement which will show in the logging if it has issues there, thats why we need the specifications of the machines and the logging so they can be checked.
There also seems to be an EL8 node in the cluster while this was only supported starting with k8s 1.19.