Currently having a problem with a cluster node which was restored from a file backup (basically the whole cluster was restored from a file/VM backup).
Now node3 has a problem and is unable to start etcd. Container restarts every 60 seconds.
$ kubectl get cs
NAME STATUS MESSAGE ERROR
etcd-2 Unhealthy Get https://192.168.253.14:2379/health: dial tcp 192.168.253.14:2379: connect: connection refused
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Healthy {"health": "true"}
etcd-1 Healthy {"health": "true"}
As you can guess, 192.168.253.14 is node3.
I already used rke remove
and added this node again to the cluster using rke up
(following the documentation Adding and Removing Nodes | RKE1). But still: etcd just won’t come up.
From the container logs, I can’t seem to find an error though:
2020-06-22 15:09:36.201889 I | etcdmain: etcd Version: 3.2.24
2020-06-22 15:09:36.201955 I | etcdmain: Git SHA: 420a45226
2020-06-22 15:09:36.201959 I | etcdmain: Go Version: go1.8.7
2020-06-22 15:09:36.201962 I | etcdmain: Go OS/Arch: linux/amd64
2020-06-22 15:09:36.201967 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2020-06-22 15:09:36.202397 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-06-22 15:09:36.204186 I | embed: peerTLS: cert = /etc/kubernetes/ssl/kube-etcd-192-168-253-14.pem, key = /etc/kubernetes/ssl/kube-etcd-192-168-253-14-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true
2020-06-22 15:09:36.204920 I | embed: listening for peers on https://0.0.0.0:2380
2020-06-22 15:09:36.204976 I | embed: listening for client requests on 0.0.0.0:2379
2020-06-22 15:09:36.222554 C | etcdmain: member abe2877743032652 has already been bootstrapped
2020-06-22 15:10:36.757726 I | etcdmain: etcd Version: 3.2.24
2020-06-22 15:10:36.757797 I | etcdmain: Git SHA: 420a45226
2020-06-22 15:10:36.757801 I | etcdmain: Go Version: go1.8.7
2020-06-22 15:10:36.757805 I | etcdmain: Go OS/Arch: linux/amd64
2020-06-22 15:10:36.757814 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2020-06-22 15:10:36.757865 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-06-22 15:10:36.757888 I | embed: peerTLS: cert = /etc/kubernetes/ssl/kube-etcd-192-168-253-14.pem, key = /etc/kubernetes/ssl/kube-etcd-192-168-253-14-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true
2020-06-22 15:10:36.758650 I | embed: listening for peers on https://0.0.0.0:2380
2020-06-22 15:10:36.758695 I | embed: listening for client requests on 0.0.0.0:2379
2020-06-22 15:10:36.772650 C | etcdmain: member abe2877743032652 has already been bootstrapped
2020-06-22 15:11:37.292363 I | etcdmain: etcd Version: 3.2.24
2020-06-22 15:11:37.292424 I | etcdmain: Git SHA: 420a45226
2020-06-22 15:11:37.292429 I | etcdmain: Go Version: go1.8.7
2020-06-22 15:11:37.292434 I | etcdmain: Go OS/Arch: linux/amd64
2020-06-22 15:11:37.292438 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2020-06-22 15:11:37.292621 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2020-06-22 15:11:37.292689 I | embed: peerTLS: cert = /etc/kubernetes/ssl/kube-etcd-192-168-253-14.pem, key = /etc/kubernetes/ssl/kube-etcd-192-168-253-14-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true
2020-06-22 15:11:37.293395 I | embed: listening for peers on https://0.0.0.0:2380
2020-06-22 15:11:37.293457 I | embed: listening for client requests on 0.0.0.0:2379
2020-06-22 15:11:37.310058 C | etcdmain: member abe2877743032652 has already been bootstrapped
In the etcd troubleshooting guide (https://rancher.com/docs/rancher/v2.5/en/troubleshooting/kubernetes-components/etcd/), there is this final note on the page:
When a node in your etcd cluster becomes unhealthy, the recommended approach is to fix or remove the failed or unhealthy node before adding a new etcd node to the cluster.
But how? rke remove
still keeps that particular node in the cluster (it still shows up in kubectl and in Rancher UI). Is there a particular etcd-internal command to execute to remove the failed node?
Any other ideas how to “fix” this constantly restarting etcd container on node3?