Rancher Fails Provisioning - Etcd Plane Unhealthy - Cert Signed by Unknown Authority

jolive · February 10, 2022, 11:41pm

I’ve been attempting to run Rancher in a single-node Docker deployment. Everything seems fine until creating the first cluster, at that point, the UI remains in the “provisioning” state with the following error: [etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [192.168.2.254] failed to report healthy…".

I have done some research on this issue, which is commonly caused when a node is re-used and the certificates are not properly cleaned up; however, for me this occurs on a fresh install of the OS, including deleting and repartitioning the file systems. I assume there is something else in my environment that is causing this, but I’ve been at this on and off for a month with the exact same results. Every attempt to create a cluster has failed.

Etcd container is logging:

2022-02-10 17:54:49.448020 I | embed: rejected connection from "192.168.2.254:37626" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")

Rancher server is logging:

2022/02/10 17:56:14 [WARNING] [etcd] host [192.168.2.254] failed to check etcd health: failed to get /health for host [192.168.2.254]: Get https://192.168.2.254:2379/health: remote error: tls: bad certificate
2022/02/10 17:56:14 [ERROR] cluster [c-t45b7] provisioning: [etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [192.168.2.254] failed to report healthy. Check etcd container logs on each host for more information

Environment:

Hardware: Intel i7-6700K, 4 CPU (8 VCPU), 32 GB RAM, 2 TB HDD, 220 GB SSD (bare metal).
CentOS Linux release 7.9.2009, minimal install, NTP (chronyd), UTC timezone
OS tuning: firewalld disabled, selinux disabled, swap disabled, br_netfilter loaded, net.bridge.bridge-nf-call.iptables=1
Host name: pc-mpi00482 (no domain, no DNS entry – I did try ad DNS server in a previous attempt, it did not help)
Docker 20.10.7 installed via https://releases.rancher.com/install-docker/20.10.sh (with current user added to the docker group)
Rancher: v2.6.3, installed via docker run -d --restart=unless-stopped -p 80:80 -p 443:443 --privileged rancher/rancher:v2.6.3
Kubernetes v1.21.9-rancher-1-1 cluster, created via the Rancher UI…

    docker run -d --privileged --restart=unless-stopped --net=host \
    -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run  \
    rancher/rancher-agent:v2.6.3 --server https://192.168.2.254 \
    --token svrmr76bx5lwdkm6654gsdsjqngq6mcbsq7nnsvqz9c85sp59kzlmt \
    --ca-checksum a4f6526a9dc51f94ace3217b3c379ca1de12462d06693df4cac22108d7c00766 \
    --etcd --controlplane --worker

All certificates are…

/etc/kubernetes/ssl/:

  -rw-------. 1 root root 1675 Feb 10 17:50 kube-apiserver-key.pem
  -rw-------. 1 root root 1306 Feb 10 17:50 kube-apiserver.pem
  -rw-------. 1 root root 1675 Feb 10 17:50 kube-apiserver-proxy-client-key.pem
  -rw-------. 1 root root 1151 Feb 10 17:50 kube-apiserver-proxy-client.pem
  -rw-------. 1 root root 1675 Feb 10 17:50 kube-apiserver-requestheader-ca-key.pem
  -rw-------. 1 root root 1123 Feb 10 17:50 kube-apiserver-requestheader-ca.pem
  -rw-------. 1 root root 1679 Feb 10 17:50 kube-ca-key.pem
  -rw-------. 1 root root 1058 Feb 10 17:50 kube-ca.pem
  -rw-------. 1 root root  517 Feb 10 17:50 kubecfg-kube-apiserver-proxy-client.yaml
  -rw-------. 1 root root  533 Feb 10 17:50 kubecfg-kube-apiserver-requestheader-ca.yaml
  -rw-------. 1 root root  501 Feb 10 17:50 kubecfg-kube-controller-manager.yaml
  -rw-------. 1 root root  445 Feb 10 17:50 kubecfg-kube-node.yaml
  -rw-------. 1 root root  449 Feb 10 17:50 kubecfg-kube-proxy.yaml
  -rw-------. 1 root root  465 Feb 10 17:50 kubecfg-kube-scheduler.yaml
  -rw-------. 1 root root 1675 Feb 10 17:50 kube-controller-manager-key.pem
  -rw-------. 1 root root 1107 Feb 10 17:50 kube-controller-manager.pem
  -rw-------. 1 root root 1679 Feb 10 17:50 kube-etcd-192-168-2-254-key.pem
  -rw-------. 1 root root 1298 Feb 10 17:50 kube-etcd-192-168-2-254.pem
  -rw-------. 1 root root 1675 Feb 10 17:50 kube-node-key.pem
  -rw-------. 1 root root 1115 Feb 10 17:50 kube-node.pem
  -rw-------. 1 root root 1675 Feb 10 17:50 kube-proxy-key.pem
  -rw-------. 1 root root 1090 Feb 10 17:50 kube-proxy.pem
  -rw-------. 1 root root 1675 Feb 10 17:50 kube-scheduler-key.pem
  -rw-------. 1 root root 1094 Feb 10 17:50 kube-scheduler.pem
  -rw-------. 1 root root 1675 Feb 10 17:50 kube-service-account-token-key.pem
  -rw-------. 1 root root 1277 Feb 10 17:50 kube-service-account-token.pem

/etc/kubernetes/ssl/certs:

  -rw-------. 1 root root  635 Feb 10 17:49 serverca

CONTAINER ID   IMAGE                                           COMMAND                  CREATED       STATUS          PORTS                                                                      NAMES
88cb87e634f1   rancher/mirrored-coreos-etcd:v3.4.16-rancher1   "/usr/local/bin/etcd…"   5 hours ago   Up 52 minutes                                                                              etcd
359d32e78ce7   rancher/rancher-agent:v2.6.3                    "run.sh --server htt…"   5 hours ago   Up 5 hours                                                                                 upbeat_chandrasekhar
8416ce813921   rancher/rancher:v2.6.3                          "entrypoint.sh"          6 hours ago   Up 5 hours      0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp   happy_ptolemy

Note: etcd is not running in privileged mode, while the Rancher server and agent are.

Also, running https://raw.githubusercontent.com/docker/docker/master/contrib/check-config.sh identified the following issues, but it’s not clear to me if they are significant:

warning: /proc/config.gz does not exist, searching other paths for kernel config …
(RHEL7/CentOS7: User namespaces disabled; add ‘user_namespace.enable=1’ to boot command line)
CONFIG_RESOURCE_COUNTERS: missing
CONFIG_SECURITY_APPARMOR: missing
CONFIG_EXT3_FS: missing
CONFIG_EXT3_FS_XATTR: missing
CONFIG_EXT3_FS_POSIX_ACL: missing
CONFIG_EXT3_FS_SECURITY: missing
CONFIG_IPVLAN: missing
CONFIG_AUFS_FS: missing
/dev/zfs: missing
zfs command: missing
zpool command: missing

Any help is appreciated.

Thanks.

Topic		Replies	Views
Etcd - error "tls: failed to verify client's certificate: x509 Rancher	2	4434	March 16, 2022
Unable to create a cluster - etcd cluster is unhealthy Rancher	2	5415	September 13, 2020
10-26-20 x509: certificate signed by unknown authority Rancher	5	11289	March 23, 2021
[etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [10.10.34.20] failed to report healthy. Check etcd container logs on each host for more information Rancher	2	4404	October 14, 2022
Corrupted etcd?	0	903	March 29, 2022

Rancher Fails Provisioning - Etcd Plane Unhealthy - Cert Signed by Unknown Authority

Related topics