I have a 3 node HA cluster deployed with rke 0.3.1; all 3 nodes have etcd,controlplane,worker roles.(cluster.yml: network plugin: flannel, dns not set explicitly) “in front” of the cluster i have Haproxy doing TLS-offloading. everything is working fine sofar
Now i’m doing failover-tests by simply disconnecting a single node from the network and watching how workloads are distributed. I discovered the following problem(s):
- If i disconnect the node, the coredns pod is running on, other pods can’t resolve names anymore. The coredns pod does not get rescheduled it just stays in state unavailable/“running”. I discovered that the coredns deployment has NoExecute & NoSchedule Tolerations. Once I remove the NoExecute Toleration, the pod gets rescheduled and other pods/workloads can use NameResolution inside the cluster again. Now my questions:
- Why are the Tolerations present? What is the rationale behind them?
- Can I safely remove those?
in the coredns-autoscaler deployment I see no “preventSinglePointFailure” anywhere, but according to this issue it should automagically be present (where?) - https://github.com/rancher/rke/issues/1625
After reconnecting a node, many (rancher-?)workloads just keep showing “Updating” in the rancherUI (ingress-nginx-controller, cattle-node-agent, exporter-node-cluster-monitoring, kube-flannel, metrics-server), and many of them show 2/3 pods in state “unavailable”, BUT when looking at the pod, the container is shown as “running” - Does anyone know what is wrong here or what i did wrong in my deployment?
thx in advance for your time!