How to remove a broken etcd node?

Currently having a problem with a cluster node which was restored from a file backup (basically the whole cluster was restored from a file/VM backup).

Now node3 has a problem and is unable to start etcd. Container restarts every 60 seconds.

$ kubectl get cs
NAME                 STATUS      MESSAGE                                                                                             ERROR
etcd-2               Unhealthy   Get https://192.168.253.14:2379/health: dial tcp 192.168.253.14:2379: connect: connection refused
scheduler            Healthy     ok
controller-manager   Healthy     ok
etcd-0               Healthy     {"health": "true"}
etcd-1               Healthy     {"health": "true"}

As you can guess, 192.168.253.14 is node3.

I already used rke remove and added this node again to the cluster using rke up (following the documentation Adding and Removing Nodes | RKE1). But still: etcd just won’t come up.

From the container logs, I can’t seem to find an error though:

2020-06-22 15:09:36.201889 I | etcdmain: etcd Version: 3.2.24
2020-06-22 15:09:36.201955 I | etcdmain: Git SHA: 420a45226
2020-06-22 15:09:36.201959 I | etcdmain: Go Version: go1.8.7
2020-06-22 15:09:36.201962 I | etcdmain: Go OS/Arch: linux/amd64
2020-06-22 15:09:36.201967 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2020-06-22 15:09:36.202397 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-06-22 15:09:36.204186 I | embed: peerTLS: cert = /etc/kubernetes/ssl/kube-etcd-192-168-253-14.pem, key = /etc/kubernetes/ssl/kube-etcd-192-168-253-14-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true
2020-06-22 15:09:36.204920 I | embed: listening for peers on https://0.0.0.0:2380
2020-06-22 15:09:36.204976 I | embed: listening for client requests on 0.0.0.0:2379
2020-06-22 15:09:36.222554 C | etcdmain: member abe2877743032652 has already been bootstrapped
2020-06-22 15:10:36.757726 I | etcdmain: etcd Version: 3.2.24
2020-06-22 15:10:36.757797 I | etcdmain: Git SHA: 420a45226
2020-06-22 15:10:36.757801 I | etcdmain: Go Version: go1.8.7
2020-06-22 15:10:36.757805 I | etcdmain: Go OS/Arch: linux/amd64
2020-06-22 15:10:36.757814 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2020-06-22 15:10:36.757865 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-06-22 15:10:36.757888 I | embed: peerTLS: cert = /etc/kubernetes/ssl/kube-etcd-192-168-253-14.pem, key = /etc/kubernetes/ssl/kube-etcd-192-168-253-14-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true
2020-06-22 15:10:36.758650 I | embed: listening for peers on https://0.0.0.0:2380
2020-06-22 15:10:36.758695 I | embed: listening for client requests on 0.0.0.0:2379
2020-06-22 15:10:36.772650 C | etcdmain: member abe2877743032652 has already been bootstrapped
2020-06-22 15:11:37.292363 I | etcdmain: etcd Version: 3.2.24
2020-06-22 15:11:37.292424 I | etcdmain: Git SHA: 420a45226
2020-06-22 15:11:37.292429 I | etcdmain: Go Version: go1.8.7
2020-06-22 15:11:37.292434 I | etcdmain: Go OS/Arch: linux/amd64
2020-06-22 15:11:37.292438 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2020-06-22 15:11:37.292621 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-06-22 15:11:37.292689 I | embed: peerTLS: cert = /etc/kubernetes/ssl/kube-etcd-192-168-253-14.pem, key = /etc/kubernetes/ssl/kube-etcd-192-168-253-14-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true
2020-06-22 15:11:37.293395 I | embed: listening for peers on https://0.0.0.0:2380
2020-06-22 15:11:37.293457 I | embed: listening for client requests on 0.0.0.0:2379
2020-06-22 15:11:37.310058 C | etcdmain: member abe2877743032652 has already been bootstrapped

In the etcd troubleshooting guide (https://rancher.com/docs/rancher/v2.5/en/troubleshooting/kubernetes-components/etcd/), there is this final note on the page:

When a node in your etcd cluster becomes unhealthy, the recommended approach is to fix or remove the failed or unhealthy node before adding a new etcd node to the cluster.

But how? rke remove still keeps that particular node in the cluster (it still shows up in kubectl and in Rancher UI). Is there a particular etcd-internal command to execute to remove the failed node?
Any other ideas how to “fix” this constantly restarting etcd container on node3?

1 Like

We are having a similar issue, any suggestions? Thanks.

We are having the same issue, any resolution ?

We have the same problem. Hardware failure on 1 of the etcd master node and we cannot recover. Does this mean that if you are using RKE kubernetes version you are not protected against hardware failures of the master node? KOPS and Kubeadm have methods to remove dead etcd/master nodes but RKE doesn’t seem to have it.

If you use the rke cli you should be able to comment out the ‘bad’ node and rerun rke up. Make sure to either clean up, or power off the offending node after.