NoExecute tolerations in rke-deployed HA cluster workload(s)

kkopanocek · November 14, 2019, 3:37pm

Hi!

I have a 3 node HA cluster deployed with rke 0.3.1; all 3 nodes have etcd,controlplane,worker roles.(cluster.yml: network plugin: flannel, dns not set explicitly) “in front” of the cluster i have Haproxy doing TLS-offloading. everything is working fine sofar

Now i’m doing failover-tests by simply disconnecting a single node from the network and watching how workloads are distributed. I discovered the following problem(s):

If i disconnect the node, the coredns pod is running on, other pods can’t resolve names anymore. The coredns pod does not get rescheduled it just stays in state unavailable/“running”. I discovered that the coredns deployment has NoExecute & NoSchedule Tolerations. Once I remove the NoExecute Toleration, the pod gets rescheduled and other pods/workloads can use NameResolution inside the cluster again. Now my questions:
Why are the Tolerations present? What is the rationale behind them?
Can I safely remove those?

Notes:

in the coredns-autoscaler deployment I see no “preventSinglePointFailure” anywhere, but according to this issue it should automagically be present (where?) - https://github.com/rancher/rke/issues/1625
After reconnecting a node, many (rancher-?)workloads just keep showing “Updating” in the rancherUI (ingress-nginx-controller, cattle-node-agent, exporter-node-cluster-monitoring, kube-flannel, metrics-server), and many of them show 2/3 pods in state “unavailable”, BUT when looking at the pod, the container is shown as “running” - Does anyone know what is wrong here or what i did wrong in my deployment?

thx in advance for your time!

Topic		Replies	Views
CoreDNS not configuring nameservers Rancher	4	2203	December 29, 2020
Starter questions for 5. Load balancing / RKE HA Rancher	1	815	July 23, 2020
Draining a node does not evict pods Rancher	3	4121	June 21, 2019
Rescheduling PODs after RKE worker node failure in less than 5 minutes Rancher	4	2526	September 15, 2020
Seeking Help with Kube-Proxy and Network Errors on HA RKE2 Cluster Deployment	0	24	January 22, 2025

NoExecute tolerations in rke-deployed HA cluster workload(s)

Related topics