How to remove a broken etcd node?

Napsty · June 22, 2020, 3:15pm

Currently having a problem with a cluster node which was restored from a file backup (basically the whole cluster was restored from a file/VM backup).

Now node3 has a problem and is unable to start etcd. Container restarts every 60 seconds.

$ kubectl get cs
NAME                 STATUS      MESSAGE                                                                                             ERROR
etcd-2               Unhealthy   Get https://192.168.253.14:2379/health: dial tcp 192.168.253.14:2379: connect: connection refused
scheduler            Healthy     ok
controller-manager   Healthy     ok
etcd-0               Healthy     {"health": "true"}
etcd-1               Healthy     {"health": "true"}

As you can guess, 192.168.253.14 is node3.

I already used rke remove and added this node again to the cluster using rke up (following the documentation Adding and Removing Nodes | RKE1). But still: etcd just won’t come up.

From the container logs, I can’t seem to find an error though:

2020-06-22 15:09:36.201889 I | etcdmain: etcd Version: 3.2.24
2020-06-22 15:09:36.201955 I | etcdmain: Git SHA: 420a45226
2020-06-22 15:09:36.201959 I | etcdmain: Go Version: go1.8.7
2020-06-22 15:09:36.201962 I | etcdmain: Go OS/Arch: linux/amd64
2020-06-22 15:09:36.201967 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2020-06-22 15:09:36.202397 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-06-22 15:09:36.204186 I | embed: peerTLS: cert = /etc/kubernetes/ssl/kube-etcd-192-168-253-14.pem, key = /etc/kubernetes/ssl/kube-etcd-192-168-253-14-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true
2020-06-22 15:09:36.204920 I | embed: listening for peers on https://0.0.0.0:2380
2020-06-22 15:09:36.204976 I | embed: listening for client requests on 0.0.0.0:2379
2020-06-22 15:09:36.222554 C | etcdmain: member abe2877743032652 has already been bootstrapped
2020-06-22 15:10:36.757726 I | etcdmain: etcd Version: 3.2.24
2020-06-22 15:10:36.757797 I | etcdmain: Git SHA: 420a45226
2020-06-22 15:10:36.757801 I | etcdmain: Go Version: go1.8.7
2020-06-22 15:10:36.757805 I | etcdmain: Go OS/Arch: linux/amd64
2020-06-22 15:10:36.757814 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2020-06-22 15:10:36.757865 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-06-22 15:10:36.757888 I | embed: peerTLS: cert = /etc/kubernetes/ssl/kube-etcd-192-168-253-14.pem, key = /etc/kubernetes/ssl/kube-etcd-192-168-253-14-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true
2020-06-22 15:10:36.758650 I | embed: listening for peers on https://0.0.0.0:2380
2020-06-22 15:10:36.758695 I | embed: listening for client requests on 0.0.0.0:2379
2020-06-22 15:10:36.772650 C | etcdmain: member abe2877743032652 has already been bootstrapped
2020-06-22 15:11:37.292363 I | etcdmain: etcd Version: 3.2.24
2020-06-22 15:11:37.292424 I | etcdmain: Git SHA: 420a45226
2020-06-22 15:11:37.292429 I | etcdmain: Go Version: go1.8.7
2020-06-22 15:11:37.292434 I | etcdmain: Go OS/Arch: linux/amd64
2020-06-22 15:11:37.292438 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2020-06-22 15:11:37.292621 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-06-22 15:11:37.292689 I | embed: peerTLS: cert = /etc/kubernetes/ssl/kube-etcd-192-168-253-14.pem, key = /etc/kubernetes/ssl/kube-etcd-192-168-253-14-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true
2020-06-22 15:11:37.293395 I | embed: listening for peers on https://0.0.0.0:2380
2020-06-22 15:11:37.293457 I | embed: listening for client requests on 0.0.0.0:2379
2020-06-22 15:11:37.310058 C | etcdmain: member abe2877743032652 has already been bootstrapped

In the etcd troubleshooting guide (https://rancher.com/docs/rancher/v2.5/en/troubleshooting/kubernetes-components/etcd/), there is this final note on the page:

When a node in your etcd cluster becomes unhealthy, the recommended approach is to fix or remove the failed or unhealthy node before adding a new etcd node to the cluster.

But how? rke remove still keeps that particular node in the cluster (it still shows up in kubectl and in Rancher UI). Is there a particular etcd-internal command to execute to remove the failed node?
Any other ideas how to “fix” this constantly restarting etcd container on node3?

Paul1 · July 13, 2021, 4:41pm

We are having a similar issue, any suggestions? Thanks.

RAll · November 22, 2021, 7:26pm

We are having the same issue, any resolution ?

MrAmbiG · June 16, 2022, 1:04am

We have the same problem. Hardware failure on 1 of the etcd master node and we cannot recover. Does this mean that if you are using RKE kubernetes version you are not protected against hardware failures of the master node? KOPS and Kubeadm have methods to remove dead etcd/master nodes but RKE doesn’t seem to have it.

aemneina · June 16, 2022, 2:55am

If you use the rke cli you should be able to comment out the ‘bad’ node and rerun rke up. Make sure to either clean up, or power off the offending node after.

Topic	Replies	Views
[SOLVED] Remove failed ETCD node Rancher	1998	October 13, 2021
Unable to (re)add etcd node to cluster Rancher	565	October 21, 2022
Cannot restore etcd snapshot Rancher	591	February 21, 2020
Removing the etcd container	521	May 27, 2020
Etcd unhealthy - hot restart Rancher	1349	August 6, 2019

How to remove a broken etcd node?

Related topics