Etcd not running on one node

I have 3 nodes setup to run RKE. I am getting the alert: Component etcd-0 is unhealthy

I had a look at the “Troubleshooting etcd Nodes” and sure enough, the first check indicates one of the etcd nodes is not running.

What can I do with that information?

I can’t perform the next step, which is to check for logs, because there is no etcd container.

What can I check to see why etcd is not running?

Docker allows you to view logs for a stopped or exited container. Do docker ps -a | grep etcd to look for the stopped etcd container, and then look at the logs:

Also, for crashed containers, you may be able to find old logs under /var/lib/docker/containers/<container id>/<container id>-json.log. Look for similar files named *-json.log-1, *-json.log-2, etc.

It turns out all the containers have exited with no obvious cause, and there is no etcd.

Thinking there must have been a setup problem with this node, I am trying to run “rke up” to set it up again, but it fails to ssh to the problem node - ranchm01. This is also the node I am running rke up from:

[root@ranchm01 ~]# rke up
INFO[0000] Running RKE version: v1.1.2
INFO[0000] Initiating Kubernetes cluster
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates
INFO[0000] [certificates] Generating admin certificates and kubeconfig
INFO[0000] Successfully Deployed state file at [./cluster.rkestate]
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [ranchm03]
INFO[0000] [dialer] Setup tunnel for host [ranchm02]
INFO[0000] [dialer] Setup tunnel for host [ranchm01]
WARN[0000] Failed to set up SSH tunneling for host [ranchm01]: Can’t retrieve Docker Info: error during connect: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info: Unable to access node with address [ranchm01:22] using SSH. Please check if the configured key or specified key file is a valid SSH Private Key. Error: Error configuring SSH: ssh: no key found
WARN[0000] Removing host [ranchm01] from node lists

I have confirmed I can do a passwordless login from ranchm01 to ranchm01 as the rancher user by this:
[root@ranchm01 ~]# ssh -i /home/rancher/.ssh/id_rsa rancher@ranchm01 docker ps

The config for all nodes is identical:

nodes:

  • address: ranchm01
    port: “22”
    internal_address: “”
    role:
    • controlplane
    • worker
    • etcd
      hostname_override: “”
      user: rancher
      docker_socket: /var/run/docker.sock
      ssh_key: /home/rancher/.ssh/id_rsa
      ssh_key_path: “”
      ssh_cert: “”
      ssh_cert_path: “”
      labels: {}
      taints: []
  • address: ranchm02
    port: “22”
    internal_address: “”
    role:
    • controlplane
    • worker
    • etcd
      hostname_override: “”
      user: rancher
      docker_socket: /var/run/docker.sock
      ssh_key: “”
      ssh_key_path: /home/rancher/.ssh/id_rsa
      ssh_cert: “”
      ssh_cert_path: “”
      labels: {}
      taints: []
  • address: ranchm03
    port: “22”
    internal_address: “”
    role:
    • controlplane
    • worker
    • etcd
      hostname_override: “”
      user: rancher
      docker_socket: /var/run/docker.sock
      ssh_key: “”
      ssh_key_path: /home/rancher/.ssh/id_rsa
      ssh_cert: “”
      ssh_cert_path: “”
      labels: {}
      taints: []

What could be causing this ssh failure?

I realised my mistake. ranchm01 incorrectly has ssh_key instead of ssh_key_path in its config!
It is up and running now.