HA - UI failure with single etcd node reboot

I’m testing Rancher v2.3.1 HA and having issues when a single node is unexpectedly down. During an unexpected down-time or reboot, the entire cluster seems to stop responding until the node is healthy and communicating back with the cluster. However, all seems well during planned maintenance such as if I drain/cordon the node first. Is this the expected behavior of the cluster during a single fault, or is it a configuration issue?

I also get this error during the single node downtime/reboot. I would expect during this state that kubectl returns the status of the two operational nodes and one failed node rather than failing altogether:
kubectl get nodes
Unable to connect to the server: dial tcp 172.17.141.57:6443: connect: no route to host

I’m using a single load balancer (nginx), and three etcd/worker nodes in the cluster as described in attached kubectl output.

My deployment matches the recommendations/steps here: https://rancher.com/docs/rancher/v2.x/en/installation/ha/

Kubectl output…
kubectl get nodes
NAME STATUS ROLES AGE VERSION
172.17.141.55 Ready controlplane,etcd,worker 30d v1.14.6
172.17.141.56 Ready controlplane,etcd,worker 30d v1.14.6
172.17.141.57 Ready controlplane,etcd,worker 30d v1.14.6

kubectl describe node
Name: 172.17.141.55
Roles: controlplane,etcd,worker
<…>
CreationTimestamp: Sat, 19 Oct 2019 10:33:07 -0500
Taints:
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


MemoryPressure False Mon, 18 Nov 2019 16:00:42 -0600 Wed, 30 Oct 2019 13:22:46 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 18 Nov 2019 16:00:42 -0600 Wed, 30 Oct 2019 13:22:46 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 18 Nov 2019 16:00:42 -0600 Wed, 30 Oct 2019 13:22:46 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 18 Nov 2019 16:00:42 -0600 Wed, 30 Oct 2019 13:22:46 -0500 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.17.141.55
Hostname: 172.17.141.55
Capacity:
cpu: 4
ephemeral-storage: 25304868Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7442188Ki
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 23320966311
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7339788Ki
pods: 110
System Info:
Machine ID: 311936e2303b02235b8cb
System UUID: 4F35B903-B671-F9403BC87F36
Boot ID: 5b6cf311-9c5b-7b8e53396778
Kernel Version: 3.10.0-957.27.2.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.9.2
Kubelet Version: v1.14.6
Kube-Proxy Version: v1.14.6
PodCIDR: 10.42.0.0/24
Non-terminated Pods: (11 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE


cattle-system cattle-cluster-agent-584654fb76-529bv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 102m
cattle-system cattle-node-agent-l5phk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30d
cattle-system rancher-8667558444-5ftlt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 100m
cert-manager cert-manager-5b9ff77b7-dfvm8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
cert-manager cert-manager-webhook-cfd6587ff-2j79v 0 (0%) 0 (0%) 0 (0%) 0 (0%) 100m
default hello-world-76b9c5976f-7rwr8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 100m
ingress-nginx nginx-ingress-controller-zt2xd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30d
kube-system canal-f729d 250m (6%) 0 (0%) 0 (0%) 0 (0%) 30d
kube-system metrics-server-7f6bd4c888-8smnl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 100m
kube-system tiller-deploy-7f4d76c4b6-cmxn8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 100m
streamer-namespace hello-world-585b857466-2xkfn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 100m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 250m (6%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events:

Name: 172.17.141.56
Roles: controlplane,etcd,worker
<…>
CreationTimestamp: Sat, 19 Oct 2019 10:33:15 -0500
Taints:
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


MemoryPressure False Mon, 18 Nov 2019 15:55:04 -0600 Sat, 19 Oct 2019 10:33:15 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 18 Nov 2019 15:55:04 -0600 Sat, 19 Oct 2019 10:33:15 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 18 Nov 2019 15:55:04 -0600 Sat, 19 Oct 2019 10:33:15 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 18 Nov 2019 15:55:04 -0600 Tue, 22 Oct 2019 15:06:33 -0500 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.17.141.56
Hostname: 172.17.141.56
Capacity:
cpu: 4
ephemeral-storage: 25304868Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7442188Ki
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 23320966311
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7339788Ki
pods: 110
System Info:
Machine ID: 311936e23ef70182235b8cb
System UUID: 5427A503-9E5B-560219FC0477
Boot ID: 42103394-b8e7-895bb105770b
Kernel Version: 3.10.0-957.27.2.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.9.2
Kubelet Version: v1.14.6
Kube-Proxy Version: v1.14.6
PodCIDR: 10.42.1.0/24
Non-terminated Pods: (12 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE


cattle-system cattle-node-agent-w6lk7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30d
cattle-system rancher-8667558444-m76bl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
cattle-system rancher-8667558444-tdjwt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
cert-manager cert-manager-cainjector-59d69b9b-tht57 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
ingress-nginx default-http-backend-5954bd5d8c-sc4z8 10m (0%) 10m (0%) 20Mi (0%) 20Mi (0%) 98m
ingress-nginx nginx-ingress-controller-grf22 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30d
kube-system canal-n6tdg 250m (6%) 0 (0%) 0 (0%) 0 (0%) 30d
kube-system coredns-autoscaler-5d5d49b8ff-n9fbh 20m (0%) 0 (0%) 10Mi (0%) 0 (0%) 98m
kube-system coredns-bdffbc666-9x4vt 100m (2%) 0 (0%) 70Mi (0%) 170Mi (2%) 98m
messages some-rabbit-547c66b49b-glmr7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
registry registry-v2-8c6d9f568-9kfj4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
streamer-namespace streamer-app3-869b64645b-dffgx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 380m (9%) 10m (0%)
memory 100Mi (1%) 190Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
Events:

Name: 172.17.141.57
Roles: controlplane,etcd,worker
<…>
CreationTimestamp: Sat, 19 Oct 2019 10:33:32 -0500
Taints:
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


MemoryPressure False Mon, 18 Nov 2019 15:55:06 -0600 Sat, 19 Oct 2019 10:33:32 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 18 Nov 2019 15:55:06 -0600 Sat, 19 Oct 2019 10:33:32 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 18 Nov 2019 15:55:06 -0600 Sat, 19 Oct 2019 10:33:32 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 18 Nov 2019 15:55:06 -0600 Sat, 19 Oct 2019 10:34:43 -0500 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.17.141.57
Hostname: 172.17.141.57
Capacity:
cpu: 4
ephemeral-storage: 25304868Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7442192Ki
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 23320966311
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7339792Ki
pods: 110
System Info:
Machine ID: 311936e23ef70182235b8cb
System UUID: CC822DD7-8CBA-9657415336DD
Boot ID: 85eb08d7-ae1a-526f15a5777f
Kernel Version: 3.10.0-957.27.2.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.9.2
Kubelet Version: v1.14.6
Kube-Proxy Version: v1.14.6
PodCIDR: 10.42.2.0/24
Non-terminated Pods: (3 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE


cattle-system cattle-node-agent-mz2s5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30d
ingress-nginx nginx-ingress-controller-kxq9l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30d
kube-system canal-qqx8d 250m (6%) 0 (0%) 0 (0%) 0 (0%) 30d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 250m (6%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events:

is 172.17.141.57 the host that is down?

Yes, jpeake. In this test scenario, the *141.57 host was being rebooted. The other two HA nodes for controlplane/etcd/worker roles are still active and should be responding through kubectl. During this downtime, my kubectl requests for the entire cluster fail. This would be the reason the UI also stops responding in this scenario.

Here are the cluster details during normal operation. The entire cluster stops responding when a single node faults unexpectedly.

kubectl get nodes
NAME STATUS ROLES AGE VERSION
172.17.141.55 Ready controlplane,etcd,worker 30d v1.14.6
172.17.141.56 Ready controlplane,etcd,worker 30d v1.14.6
172.17.141.57 Ready controlplane,etcd,worker 30d v1.14.6

kubectl describe node
Name: 172.17.141.55
Roles: controlplane,etcd,worker
Labels:
Annotations:
CreationTimestamp: Sat, 19 Oct 2019 10:33:07 -0500
Taints:
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


MemoryPressure False Mon, 18 Nov 2019 16:00:42 -0600 Wed, 30 Oct 2019 13:22:46 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 18 Nov 2019 16:00:42 -0600 Wed, 30 Oct 2019 13:22:46 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 18 Nov 2019 16:00:42 -0600 Wed, 30 Oct 2019 13:22:46 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 18 Nov 2019 16:00:42 -0600 Wed, 30 Oct 2019 13:22:46 -0500 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.17.141.55
Hostname: 172.17.141.55
Capacity:
cpu: 4
ephemeral-storage: 25304868Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7442188Ki
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 23320966311
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7339788Ki
pods: 110
System Info:
Machine ID: 311936e2303b02235b8cb
System UUID: 4F35B903-B671-F9403BC87F36
Boot ID: 5b6cf311-9c5b-7b8e53396778
Kernel Version: 3.10.0-957.27.2.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.9.2
Kubelet Version: v1.14.6
Kube-Proxy Version: v1.14.6
PodCIDR: 10.42.0.0/24
Non-terminated Pods: (11 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE


cattle-system cattle-cluster-agent-584654fb76-529bv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 102m
cattle-system cattle-node-agent-l5phk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30d
cattle-system rancher-8667558444-5ftlt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 100m
cert-manager cert-manager-5b9ff77b7-dfvm8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
cert-manager cert-manager-webhook-cfd6587ff-2j79v 0 (0%) 0 (0%) 0 (0%) 0 (0%) 100m
default hello-world-76b9c5976f-7rwr8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 100m
ingress-nginx nginx-ingress-controller-zt2xd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30d
kube-system canal-f729d 250m (6%) 0 (0%) 0 (0%) 0 (0%) 30d
kube-system metrics-server-7f6bd4c888-8smnl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 100m
kube-system tiller-deploy-7f4d76c4b6-cmxn8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 100m
streamer-namespace hello-world-585b857466-2xkfn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 100m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 250m (6%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events:

Name: 172.17.141.56
Roles: controlplane,etcd,worker
Labels:
Annotations:
CreationTimestamp: Sat, 19 Oct 2019 10:33:15 -0500
Taints:
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


MemoryPressure False Mon, 18 Nov 2019 15:55:04 -0600 Sat, 19 Oct 2019 10:33:15 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 18 Nov 2019 15:55:04 -0600 Sat, 19 Oct 2019 10:33:15 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 18 Nov 2019 15:55:04 -0600 Sat, 19 Oct 2019 10:33:15 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 18 Nov 2019 15:55:04 -0600 Tue, 22 Oct 2019 15:06:33 -0500 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.17.141.56
Hostname: 172.17.141.56
Capacity:
cpu: 4
ephemeral-storage: 25304868Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7442188Ki
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 23320966311
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7339788Ki
pods: 110
System Info:
Machine ID: 311936e23ef70182235b8cb
System UUID: 5427A503-9E5B-560219FC0477
Boot ID: 42103394-b8e7-895bb105770b
Kernel Version: 3.10.0-957.27.2.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.9.2
Kubelet Version: v1.14.6
Kube-Proxy Version: v1.14.6
PodCIDR: 10.42.1.0/24
Non-terminated Pods: (12 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE


cattle-system cattle-node-agent-w6lk7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30d
cattle-system rancher-8667558444-m76bl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
cattle-system rancher-8667558444-tdjwt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
cert-manager cert-manager-cainjector-59d69b9b-tht57 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
ingress-nginx default-http-backend-5954bd5d8c-sc4z8 10m (0%) 10m (0%) 20Mi (0%) 20Mi (0%) 98m
ingress-nginx nginx-ingress-controller-grf22 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30d
kube-system canal-n6tdg 250m (6%) 0 (0%) 0 (0%) 0 (0%) 30d
kube-system coredns-autoscaler-5d5d49b8ff-n9fbh 20m (0%) 0 (0%) 10Mi (0%) 0 (0%) 98m
kube-system coredns-bdffbc666-9x4vt 100m (2%) 0 (0%) 70Mi (0%) 170Mi (2%) 98m
messages some-rabbit-547c66b49b-glmr7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
registry registry-v2-8c6d9f568-9kfj4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
streamer-namespace streamer-app3-869b64645b-dffgx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 98m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 380m (9%) 10m (0%)
memory 100Mi (1%) 190Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
Events:

Name: 172.17.141.57
Roles: controlplane,etcd,worker
Labels:
Annotations:
CreationTimestamp: Sat, 19 Oct 2019 10:33:32 -0500
Taints:
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


MemoryPressure False Mon, 18 Nov 2019 15:55:06 -0600 Sat, 19 Oct 2019 10:33:32 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 18 Nov 2019 15:55:06 -0600 Sat, 19 Oct 2019 10:33:32 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 18 Nov 2019 15:55:06 -0600 Sat, 19 Oct 2019 10:33:32 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 18 Nov 2019 15:55:06 -0600 Sat, 19 Oct 2019 10:34:43 -0500 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.17.141.57
Hostname: 172.17.141.57
Capacity:
cpu: 4
ephemeral-storage: 25304868Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7442192Ki
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 23320966311
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7339792Ki
pods: 110
System Info:
Machine ID: 311936e23ef70182235b8cb
System UUID: CC822DD7-8CBA-9657415336DD
Boot ID: 85eb08d7-ae1a-526f15a5777f
Kernel Version: 3.10.0-957.27.2.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.9.2
Kubelet Version: v1.14.6
Kube-Proxy Version: v1.14.6
PodCIDR: 10.42.2.0/24
Non-terminated Pods: (3 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE


cattle-system cattle-node-agent-mz2s5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30d
ingress-nginx nginx-ingress-controller-kxq9l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30d
kube-system canal-qqx8d 250m (6%) 0 (0%) 0 (0%) 0 (0%) 30d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 250m (6%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events:

Rancher Forums doesn’t allow me to post the ‘kubectl describe nodes’ output here. It would be helpful if I could do that for you to see what the cluster configuration & nodes look like.

I am looking at the kube config files that were generated for my Racher HA installs. And it does appear they are pointing at a single node of the trio. This is in contrast to kubecfg generated for RKE launched clusters which create a “cluster” for each node.

So for the Rancher HA cluster itself, you can change the “server: http://x.x.x.x:6443” setting to point it at one of the nodes that is up. Or create a kubecfg cluster entry for each node.

Yes, this looks to be the issue. The Rancher HA auto-generated config file ‘kube_config_rancher-cluster.yml’ only had the *.141.57 server node listed. Thus kubectl calls fail when this single server fails. I’ll add all three cluster/server entries to the ~/.kube/config file and retest. Thanks