Etcd deployment issues on docker

I am trying out kubernetes and rancher for the first time and tried launching it via docker on AlmaLinux 8.4 (CentOS variant). I am able to get everything else to work just fine but the etcd node is having some issues. This is the from the logs on the etcd docker node. How do I resolve this? Been doing some research/googling but not sure i’m getting the right answers.

Thanks!!!

2021-06-28 14:45:32.863447 I | embed: rejected connection from "10.150.10.227:36827" (error "EOF", ServerName "") 2021-06-28 14:46:59.978561 I | embed: rejected connection from "10.150.10.229:33302" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:47:04.995763 I | embed: rejected connection from "10.150.10.229:33304" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:47:10.014932 I | embed: rejected connection from "10.150.10.229:33306" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:47:15.031500 I | embed: rejected connection from "10.150.10.229:33308" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:47:20.048865 I | embed: rejected connection from "10.150.10.229:33310" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:47:25.065452 I | embed: rejected connection from "10.150.10.229:33312" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:47:30.082106 I | embed: rejected connection from "10.150.10.229:33314" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:47:35.098922 I | embed: rejected connection from "10.150.10.229:33316" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:47:40.115567 I | embed: rejected connection from "10.150.10.229:33318" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:47:45.133261 I | embed: rejected connection from "10.150.10.229:33320" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:47:50.149638 I | embed: rejected connection from "10.150.10.229:33322" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:47:55.166213 I | embed: rejected connection from "10.150.10.229:33324" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:48:00.182151 I | embed: rejected connection from "10.150.10.229:33326" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:48:05.198583 I | embed: rejected connection from "10.150.10.229:33328" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:48:10.215371 I | embed: rejected connection from "10.150.10.229:33330" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:48:15.232013 I | embed: rejected connection from "10.150.10.229:33332" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:48:20.249594 I | embed: rejected connection from "10.150.10.229:33334" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "") 2021-06-28 14:48:25.266216 I | embed: rejected connection from "10.150.10.229:33336" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")

Usually happens when you re-use the node without cleaning it properly (use rke remove or remove the directories listed on Rancher Docs: Removing Kubernetes Components from Nodes), or when you have run rke up without the cluster.rkestate file being present in the same directory.

2 Likes

I saw that in another post but wasn’t sure. Where do I use that command ?

As in the docker container or on the actual host that etcd resides?

Getting same results after running through all the commands to delete the docker instances, images, and volumes.

Running rke remove will delete all the data from the nodes that needs to be deleted. Either running rke remove or using newly created instances should resolve the issue.

If you are hitting this error after running rke remove or when using newly created instances, then please share the output of ls -la /etc/kubernetes/ssl of all the nodes involved. I assume 10.150.10.229 and 10.150.10.227 are nodes that you are trying to provision? Probably also helps to include the cluster.yml here.

Built a new etcd docker host.

Got the same issue after building new host. :frowning:

It is failing on etcd still.

[ameyer@rancher-etcd02 ~]$ ls -la /etc/kubernetes/ssl
total 56
drwxr-xr-x. 3 root root 4096 Jun 28 17:17 .
drwxr-xr-x. 3 root root 4096 Jun 28 18:48 ..
drwx------. 2 root root 4096 Jun 28 17:05 certs
-rw-------. 1 root root 1017 Jun 28 17:17 kube-ca.pem
-rw-------. 1 root root  445 Jun 28 17:17 kubecfg-kube-node.yaml
-rw-------. 1 root root  449 Jun 28 17:17 kubecfg-kube-proxy.yaml
-rw-------. 1 root root 1675 Jun 28 17:17 kube-etcd-10-150-10-229-key.pem
-rw-------. 1 root root 1289 Jun 28 17:17 kube-etcd-10-150-10-229.pem
-rw-------. 1 root root 1679 Jun 28 17:17 kube-etcd-10-150-10-231-key.pem
-rw-------. 1 root root 1289 Jun 28 17:17 kube-etcd-10-150-10-231.pem
-rw-------. 1 root root 1679 Jun 28 17:17 kube-node-key.pem
-rw-------. 1 root root 1070 Jun 28 17:17 kube-node.pem
-rw-------. 1 root root 1679 Jun 28 17:17 kube-proxy-key.pem
-rw-------. 1 root root 1046 Jun 28 17:17 kube-proxy.pem
[ameyer@rancher-etcd02 ~]$
[ameyer@rancher-etcd01 ~]$ ls -la /etc/kubernetes/ssl
total 56
drwxr-xr-x. 3 root root 4096 Jun 28 17:17 .
drwxr-xr-x. 3 root root 4096 Jun 28 18:48 ..
drwx------. 2 root root 4096 Jun 28 15:42 certs
-rw-------. 1 root root 1017 Jun 25 12:15 kube-ca.pem
-rw-------. 1 root root  445 Jun 25 12:15 kubecfg-kube-node.yaml
-rw-------. 1 root root  449 Jun 25 12:15 kubecfg-kube-proxy.yaml
-rw-------. 1 root root 1675 Jun 25 12:15 kube-etcd-10-150-10-229-key.pem
-rw-------. 1 root root 1257 Jun 25 12:15 kube-etcd-10-150-10-229.pem
-rw-------. 1 root root 1679 Jun 28 17:17 kube-etcd-10-150-10-231-key.pem
-rw-------. 1 root root 1289 Jun 28 17:17 kube-etcd-10-150-10-231.pem
-rw-------. 1 root root 1675 Jun 25 12:15 kube-node-key.pem
-rw-------. 1 root root 1070 Jun 25 12:15 kube-node.pem
-rw-------. 1 root root 1675 Jun 25 12:15 kube-proxy-key.pem
-rw-------. 1 root root 1046 Jun 25 12:15 kube-proxy.pem
[ameyer@rancher-etcd01 ~]$
[ameyer@rancher-control01 ~]$ ls -la /etc/kubernetes/ssl/
total 108
drwxr-xr-x. 3 root root 4096 Jun 25 12:15 .
drwxr-xr-x. 3 root root 4096 Jun 28 18:48 ..
drwx------. 2 root root 4096 Jun 28 15:25 certs
-rw-------. 1 root root 1675 Jun 25 12:15 kube-apiserver-key.pem
-rw-------. 1 root root 1269 Jun 25 12:15 kube-apiserver.pem
-rw-------. 1 root root 1679 Jun 25 12:15 kube-apiserver-proxy-client-key.pem
-rw-------. 1 root root 1107 Jun 25 12:15 kube-apiserver-proxy-client.pem
-rw-------. 1 root root 1679 Jun 25 12:15 kube-apiserver-requestheader-ca-key.pem
-rw-------. 1 root root 1082 Jun 25 12:15 kube-apiserver-requestheader-ca.pem
-rw-------. 1 root root 1675 Jun 25 12:15 kube-ca-key.pem
-rw-------. 1 root root 1017 Jun 25 12:15 kube-ca.pem
-rw-------. 1 root root  517 Jun 25 12:15 kubecfg-kube-apiserver-proxy-client.yaml
-rw-------. 1 root root  533 Jun 25 12:15 kubecfg-kube-apiserver-requestheader-ca.yaml
-rw-------. 1 root root  501 Jun 25 12:15 kubecfg-kube-controller-manager.yaml
-rw-------. 1 root root  445 Jun 25 12:15 kubecfg-kube-node.yaml
-rw-------. 1 root root  449 Jun 25 12:15 kubecfg-kube-proxy.yaml
-rw-------. 1 root root  465 Jun 25 12:15 kubecfg-kube-scheduler.yaml
-rw-------. 1 root root 1679 Jun 25 12:15 kube-controller-manager-key.pem
-rw-------. 1 root root 1062 Jun 25 12:15 kube-controller-manager.pem
-rw-------. 1 root root 1675 Jun 25 12:15 kube-node-key.pem
-rw-------. 1 root root 1070 Jun 25 12:15 kube-node.pem
-rw-------. 1 root root 1675 Jun 25 12:15 kube-proxy-key.pem
-rw-------. 1 root root 1046 Jun 25 12:15 kube-proxy.pem
-rw-------. 1 root root 1675 Jun 25 12:15 kube-scheduler-key.pem
-rw-------. 1 root root 1050 Jun 25 12:15 kube-scheduler.pem
-rw-------. 1 root root 1675 Jun 25 12:15 kube-service-account-token-key.pem
-rw-------. 1 root root 1269 Jun 25 12:15 kube-service-account-token.pem
[ameyer@rancher-control01 ~]$

Getting this:

[ameyer@rancher-etcd02 ~]$ docker exec etcd etcdctl member list
{"level":"warn","ts":"2021-06-29T03:11:32.593Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-72d80ad0-8860-4e52-8e8b-ec0d777d03a4/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
[ameyer@rancher-etcd02 ~]$ docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint status --write-out table
{"level":"warn","ts":"2021-06-29T03:12:26.422Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-382b6f24-d3ca-42aa-a89d-452cb42973d1/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
2021-06-29 03:12:26.655988 W | pkg/flags: unrecognized environment variable ETCDCTL_ENDPOINTS=
{"level":"warn","ts":"2021-06-29T03:12:31.657Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint 127.0.0.1:2379 (context deadline exceeded)
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
+----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[ameyer@rancher-etcd02 ~]$ for endpoint in $(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5"); do
>    echo "Validating connection to ${endpoint}/health"
>    docker run --net=host -v $(docker inspect kubelet --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl:/etc/kubernetes/ssl:ro appropriate/curl -s -w "\n" --cacert $(docker exec etcd printenv ETCDCTL_CACERT) --cert $(docker exec etcd printenv ETCDCTL_CERT) --key $(docker exec etcd printenv ETCDCTL_KEY) "${endpoint}/health"
> done
{"level":"warn","ts":"2021-06-29T03:14:32.546Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-3bcaea80-1036-4d87-be9b-7786afdbaab7/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
[ameyer@rancher-etcd02 ~]$

There are files from Jun 25 and Jun 28 in that directory while you say you are using new hosts, these need to be cleaned before the hosts are used so they all get the correct files deployed. We might be able to correct it but then I need to know what steps you took to get into this state.

So I deployed the initial docker image like it says to do in the quick-start. Then to deploy each control/worker/etcd node I got the docker command from the cluster admin page. I hope that is the correct information you are looking for.
10.150.10.225 is the main docker instance/host.

sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run  rancher/rancher-agent:v2.5.8 --server https://10.150.10.225 --token 4tp96xvw9nz5vrdql9cpgf9vtn7hpt2n6fq6wbg4hthpnjqhqgc95z --ca-checksum c163983c3663629034860905476a8acd9d361b86f150321c6e6a959ba3055fd8 --etcd
sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run  rancher/rancher-agent:v2.5.8 --server https://10.150.10.225 --token 4tp96xvw9nz5vrdql9cpgf9vtn7hpt2n6fq6wbg4hthpnjqhqgc95z --ca-checksum c163983c3663629034860905476a8acd9d361b86f150321c6e6a959ba3055fd8 --controlplane

Hope this helps!

I will need the logs from the provisioning run on Jun 25 12:15 as that is the timestamp on the earliest certificate files. When did you add what nodes (timestamps) and what was the logging for when this cluster started provisioning? If the node was cleaned or new yesterday (Jun 28), there can’t be files from Jun 25 on there. The easiest way to resolve this is to use new nodes or clean the nodes properly, delete the cluster from Rancher, create a new cluster and add new nodes/cleaned nodes to that cluster (thats how its normally done).

If you want to investigate why it is in this state, we need more of a timeline + actions + logs to see what happened.

1 Like

I would have posted earlier but the forum rules stopped me. :frowning: Can we please change the limited posting rule from like 24 hours or whatever it is to 1-4 hours? I bet I could have responded with more information and fixed my issue this morning.

If I opt to clean the nodes. I followed the instructions before but maybe I missed some steps? Did I need to delete the /etc/kubernetes/ssl folder(s)? I am more than happy to stop and remove everything on all etcd/worker/control nodes and start over. Whatever is easiest.

Just got this errror after properly cleaning up the etd node.

This cluster is currently Provisioning; areas that interact directly with it will not be available until the API is ready.

[controlPlane] Failed to bring up Control Plane: [Failed to verify healthcheck: Failed to check https://localhost:6443/healthz for service [kube-apiserver] on host [10.150.10.227]: Get "https://localhost:6443/healthz": dial tcp [::1]:6443: connect: connection refused, log: W0630 18:51:17.325480 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.150.10.231:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")". Reconnecting...]
[ameyer@rancher-etcd02 ~]$ sudo firewall-cmd --list-all
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: ens192
  sources: 
  services: cockpit dhcpv6-client ssh
  ports: 2379/tcp 2380/tcp 2379/udp
  protocols: 
  masquerade: no
  forward-ports: 
  source-ports: 
  icmp-blocks: 
  rich rules: 
[ameyer@rancher-etcd02 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:bf:68:00 brd ff:ff:ff:ff:ff:ff
    inet 10.150.10.231/23 brd 10.150.11.255 scope global noprefixroute ens192
       valid_lft forever preferred_lft forever
    inet6 fe80::250:56ff:febf:6800/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:66:df:5e:12 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:66ff:fedf:5e12/64 scope link 
       valid_lft forever preferred_lft forever
[ameyer@rancher-etcd02 ~]$

Provision new servers for, control/worker/etcd and now am getting the following error:

This cluster is currently Provisioning; areas that interact directly with it will not be available until the API is ready.

[workerPlane] Failed to bring up Worker Plane: [Failed to verify healthcheck: Failed to check http://localhost:10256/healthz for service [kube-proxy] on host [10.150.11.71]: Get "http://localhost:10256/healthz": dial tcp [::1]:10256: connect: connection refused, log: I0630 21:17:55.939560 29438 proxier.go:826] syncProxyRules took 129.26458ms]
This cluster is currently Provisioning; areas that interact directly with it will not be available until the API is ready.

[controlPlane] Failed to upgrade Control Plane: [[host rancher-control001 not ready]]

If you are going to clean things, you need to clean everything (delete the cluster, clean all nodes that were part of it), not just a single node.

The logs from kube-proxy could be interesting, but I just saw that you are running AlmaLinux (which is not tested) and also version 8.4 (which the ones that are related (CentOS/RHEL/Oracle Linux) are not tested). So my advice would be to use a tested OS to make this work, for version 8 of EL you need at least Kubernetes 1.19 and firewalld disabled. If you want to make it work, please post more info like OS details (docker info preferably) and the logs from the Docker containers created on the added nodes (docker logs $container)