Etcd not running on one node

Andrew_Zahra2 · June 17, 2020, 9:23am

I have 3 nodes setup to run RKE. I am getting the alert: Component etcd-0 is unhealthy

I had a look at the “Troubleshooting etcd Nodes” and sure enough, the first check indicates one of the etcd nodes is not running.

What can I do with that information?

I can’t perform the next step, which is to check for logs, because there is no etcd container.

What can I check to see why etcd is not running?

Stefan_Lasiewski · June 17, 2020, 6:59pm

Docker allows you to view logs for a stopped or exited container. Do docker ps -a | grep etcd to look for the stopped etcd container, and then look at the logs:

Also, for crashed containers, you may be able to find old logs under /var/lib/docker/containers/<container id>/<container id>-json.log. Look for similar files named *-json.log-1, *-json.log-2, etc.

Andrew_Zahra2 · June 18, 2020, 1:18am

It turns out all the containers have exited with no obvious cause, and there is no etcd.

Thinking there must have been a setup problem with this node, I am trying to run “rke up” to set it up again, but it fails to ssh to the problem node - ranchm01. This is also the node I am running rke up from:

[root@ranchm01 ~]# rke up
INFO[0000] Running RKE version: v1.1.2
INFO[0000] Initiating Kubernetes cluster
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates
INFO[0000] [certificates] Generating admin certificates and kubeconfig
INFO[0000] Successfully Deployed state file at [./cluster.rkestate]
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [ranchm03]
INFO[0000] [dialer] Setup tunnel for host [ranchm02]
INFO[0000] [dialer] Setup tunnel for host [ranchm01]
WARN[0000] Failed to set up SSH tunneling for host [ranchm01]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [ranchm01:22] using SSH. Please check if the configured key or specified key file is a valid SSH Private Key. Error: Error configuring SSH: ssh: no key found
WARN[0000] Removing host [ranchm01] from node lists

I have confirmed I can do a passwordless login from ranchm01 to ranchm01 as the rancher user by this:
[root@ranchm01 ~]# ssh -i /home/rancher/.ssh/id_rsa rancher@ranchm01 docker ps

The config for all nodes is identical:

nodes:

address: ranchm01
port: “22”
internal_address: “”
role:
- controlplane
- worker
- etcd
  hostname_override: “”
  user: rancher
  docker_socket: /var/run/docker.sock
  ssh_key: /home/rancher/.ssh/id_rsa
  ssh_key_path: “”
  ssh_cert: “”
  ssh_cert_path: “”
  labels: {}
  taints: []
address: ranchm02
port: “22”
internal_address: “”
role:
- controlplane
- worker
- etcd
  hostname_override: “”
  user: rancher
  docker_socket: /var/run/docker.sock
  ssh_key: “”
  ssh_key_path: /home/rancher/.ssh/id_rsa
  ssh_cert: “”
  ssh_cert_path: “”
  labels: {}
  taints: []
address: ranchm03
port: “22”
internal_address: “”
role:
- controlplane
- worker
- etcd
  hostname_override: “”
  user: rancher
  docker_socket: /var/run/docker.sock
  ssh_key: “”
  ssh_key_path: /home/rancher/.ssh/id_rsa
  ssh_cert: “”
  ssh_cert_path: “”
  labels: {}
  taints: []

What could be causing this ssh failure?

Andrew_Zahra2 · June 18, 2020, 5:49am

I realised my mistake. ranchm01 incorrectly has ssh_key instead of ssh_key_path in its config!
It is up and running now.

Topic		Replies	Views
Single Node Rke Rancher inconsistency goes down	5	377	July 5, 2024
[etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [10.10.34.20] failed to report healthy. Check etcd container logs on each host for more information Rancher	2	4418	October 14, 2022
How to remove a broken etcd node? Rancher	4	4746	June 16, 2022
[SOLVED] Remove failed ETCD node Rancher	0	1998	October 13, 2021
Cannot restore etcd snapshot Rancher	0	591	February 21, 2020

Etcd not running on one node

Related topics