NoExecute tolerations in rke-deployed HA cluster workload(s)

Hi!

I have a 3 node HA cluster deployed with rke 0.3.1; all 3 nodes have etcd,controlplane,worker roles.(cluster.yml: network plugin: flannel, dns not set explicitly) “in front” of the cluster i have Haproxy doing TLS-offloading. everything is working fine sofar

Now i’m doing failover-tests by simply disconnecting a single node from the network and watching how workloads are distributed. I discovered the following problem(s):

  • If i disconnect the node, the coredns pod is running on, other pods can’t resolve names anymore. The coredns pod does not get rescheduled it just stays in state unavailable/“running”. I discovered that the coredns deployment has NoExecute & NoSchedule Tolerations. Once I remove the NoExecute Toleration, the pod gets rescheduled and other pods/workloads can use NameResolution inside the cluster again. Now my questions:
  • Why are the Tolerations present? What is the rationale behind them?
  • Can I safely remove those?

Notes:

  • in the coredns-autoscaler deployment I see no “preventSinglePointFailure” anywhere, but according to this issue it should automagically be present (where?) - https://github.com/rancher/rke/issues/1625

  • After reconnecting a node, many (rancher-?)workloads just keep showing “Updating” in the rancherUI (ingress-nginx-controller, cattle-node-agent, exporter-node-cluster-monitoring, kube-flannel, metrics-server), and many of them show 2/3 pods in state “unavailable”, BUT when looking at the pod, the container is shown as “running” - Does anyone know what is wrong here or what i did wrong in my deployment?

thx in advance for your time!