Unable to register new RKE cluster to Rancher 2.5.7 (k3s)

We have setup a production HA rancher cluster with K3S: v1.20.6+k3s1. It’s a 2-node setup with certs from geotrust. We have it sitting behind a Citrix LB and the certs/setup seem fine (at least to the browsers).

We have not been able to import a new RKE cluster into rancher. It sits at:

This cluster is currently Provisioning ; areas that interact directly with it will not be available until the API is ready.

Waiting for etcd, controlplane and worker nodes to be registered

I have gone over the TLS setup and verified the certs we used to create the tls-rancher-ingress.

[root@rancher-pgh02 ]# openssl verify -verbose -CAfile <(cat digicertglobalroot.pem DigiCertIntCA.crt) rancher_somewhere_com.crt
rancher_somewhere_com.crt: OK

We created the setup with the hostname set to the LB DNS name.

helm install rancher rancher-stable/rancher \
  --namespace cattle-system \
  --set hostname=rancher.somewhere.com \
  --set ingress.tls.source=secret

The process to add the node to rancher is cycling with this from docker logs:

time="2021-05-25T12:29:54Z" level=info msg="node kubelet-cp-pgh01 is not registered, restarting kubelet now"
time="2021-05-25T12:29:54Z" level=info msg="Listening on /tmp/log.sock"
time="2021-05-25T12:29:54Z" level=info msg="Rancher agent version v2.5.7 is starting"
time="2021-05-25T12:29:54Z" level=info msg="Option customConfig=map[address:10.70.12.196 internalAddress: label:map[] roles:[controlplane] taints:[]]"
time="2021-05-25T12:29:54Z" level=info msg="Option etcd=false"
time="2021-05-25T12:29:54Z" level=info msg="Option controlPlane=true"
time="2021-05-25T12:29:54Z" level=info msg="Option worker=false"
time="2021-05-25T12:29:54Z" level=info msg="Option requestedHostname=kubelet-cp-pgh01"
time="2021-05-25T12:29:54Z" level=info msg="Certificate details from https : // rancher.somewhere.com"
time="2021-05-25T12:29:54Z" level=info msg="Certificate #0 (https : // rancher.somewhere.com)"
time="2021-05-25T12:29:54Z" level=info msg="Subject: CN=rancher.somewhere.com,O=Some Where Systems LLC,L=Wellesley,ST=Massachusetts,C=US"
time="2021-05-25T12:29:54Z" level=info msg="Issuer: CN=DigiCert TLS RSA SHA256 2020 CA1,O=DigiCert Inc,C=US"
time="2021-05-25T12:29:54Z" level=info msg="IsCA: false"
time="2021-05-25T12:29:54Z" level=info msg="DNS Names: [rancher.somewhere.com www.rancher.somewhere.com]"
time="2021-05-25T12:29:54Z" level=info msg="IPAddresses: <none>"
time="2021-05-25T12:29:54Z" level=info msg="NotBefore: 2021-05-04 00:00:00 +0000 UTC"
time="2021-05-25T12:29:54Z" level=info msg="NotAfter: 2022-05-09 23:59:59 +0000 UTC"
time="2021-05-25T12:29:54Z" level=info msg="SignatureAlgorithm: SHA256-RSA"
time="2021-05-25T12:29:54Z" level=info msg="PublicKeyAlgorithm: RSA"
time="2021-05-25T12:29:54Z" level=fatal msg="Certificate chain is not complete, please check if all needed intermediate certificates are included in the server certificate (in the correct order) and if the cacerts setting in Rancher either contains the correct CA certificate (in the case of using self signed certificates) or is empty (in the case of using a certificate signed by a recognized CA). Certificate information is displayed above. error: Get \"https : // rancher.somewhere.com\": x509: certificate signed by unknown authority"

I have restarted the rancher pods, tried variations of certs in the tls.crt file and added both the intermediate and root cert to the LB setup.

Any ideas on how to get the RKE cluster to register? Willing to start over - re-do the whole rancher setup etc.

TIA.

It doesn’t seem to send the intermediate, is the Citrix LB doing TCP (Layer 4)? This is how the certificate should look like that you are configuring (and I’m assuming Citrix LB is doing TCP and you use the NGINX ingress for HTTP/Layer 7):

-----BEGIN CERTIFICATE-----
%YOUR_CERTIFICATE%
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
%YOUR_INTERMEDIATE_CERTIFICATE%
-----END CERTIFICATE-----

Ahh. OK. Thanks. I have now figured out how to bind the intermediate in the citrix and make sure it gets sent. I was able to verify that via this:

openssl s_client -showcerts -connect rancher.somewhere.com:443

So now I have a diff error in registering the nodes:

from the docker logs

INFO: Arguments: --server https://rancher.somewhere.com --token REDACTED --etcd
INFO: Environment: CATTLE_ADDRESS=10.70.12.195 CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=kubelet-etcd-pgh03 CATTLE_ROLE=,etcd CATTLE_SERVER=https://rancher.somewhere.com CATTLE_TOKEN=REDACTED
INFO: Using resolv.conf: domain somewhere.com search somewhere.com nameserver 10.70.12.21 nameserver 10.60.18.11
INFO: https://rancher.somewhere.com/ping is accessible
INFO: rancher.somewhere.com resolves to 10.70.12.203
time="2021-05-26T14:37:07Z" level=info msg="node kubelet-etcd-pgh03 is not registered, restarting kubelet now"
time="2021-05-26T14:37:07Z" level=info msg="Listening on /tmp/log.sock"
time="2021-05-26T14:37:07Z" level=info msg="Rancher agent version v2.5.7 is starting"
time="2021-05-26T14:37:07Z" level=info msg="Option controlPlane=false"
time="2021-05-26T14:37:07Z" level=info msg="Option worker=false"
time="2021-05-26T14:37:07Z" level=info msg="Option requestedHostname=kubelet-etcd-pgh03"
time="2021-05-26T14:37:07Z" level=info msg="Option customConfig=map[address:10.70.12.195 internalAddress: label:map[] roles:[etcd] taints:[]]"
time="2021-05-26T14:37:07Z" level=info msg="Option etcd=true"
time="2021-05-26T14:37:07Z" level=info msg="Connecting to wss://rancher.somewhere.com/v3/connect/register with token 5nsh66fbstvtwb8qhbfg49dq9t4nhscpr7dwmw7lzk2thlxp6g2l5g"
time="2021-05-26T14:37:07Z" level=info msg="Connecting to proxy" url="wss://rancher.somewhere.com/v3/connect/register"
time="2021-05-26T14:37:07Z" level=warning msg="Error while getting agent config: invalid response 500: nodes.management.cattle.io \"c-jmfzn/m-cdb2f653c9e4\" not found"
time="2021-05-26T14:37:12Z" level=info msg="Starting plan monitor, checking every 15 seconds"

And from the rancher GUI:

[Failed to start [rke-cp-port-listener] container on host [10.70.12.196]: Error response from daemon: driver failed programming external connectivity on endpoint rke-cp-port-listener (84f539941117bea89eecb4aa64a939ddbb155e32a2ec89dc51d12673772c95a8): (iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 6443 -j DNAT --to-destination 172.17.0.2:1337 ! -i docker0: iptables: No chain/target/match by that name. (exit status 1))]

TIA.

Missing docker chain in iptables is usually caused by someone/a script manually flushing iptables and not restarting Docker (which makes sure needed chains will be created if missing)

Onward, a restart moved things along. I’ve now got issues with the etcd plane. This is a new/first time rke version v1.2.8 setup on clean vms. Not sure why the RKE certs are a problem, they are self signed.

[etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [10.70.12.193,10.70.12.194,10.70.12.195] failed to report healthy. Check etcd container logs on each host for more information

And from those logs:

2021-05-26 17:22:44.647587 I | embed: rejected connection from "10.70.12.195:32125" (error "EOF", ServerName "")
2021-05-26 17:22:44.648268 I | embed: rejected connection from "10.70.12.195:13605" (error "EOF", ServerName "")
2021-05-26 17:22:44.714344 I | embed: rejected connection from "10.70.12.194:18263" (error "EOF", ServerName "")
2021-05-26 17:22:44.714640 I | embed: rejected connection from "10.70.12.194:13647" (error "EOF", ServerName "")
2021-05-26 17:22:44.732082 I | embed: rejected connection from "10.70.12.193:3919" (error "EOF", ServerName "")
2021-05-26 17:22:44.732210 I | embed: rejected connection from "10.70.12.193:21251" (error "EOF", ServerName "")
2021-05-26 17:22:45.130331 I | embed: rejected connection from "10.70.12.196:18941" (error "EOF", ServerName "")
2021-05-26 17:22:45.141649 I | embed: rejected connection from "10.70.12.197:1311" (error "EOF", ServerName "")
2021-05-26 17:39:32.016416 I | embed: rejected connection from "10.70.12.193:55168" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")

I may have had a few variations of the cluster.yml before I had success. Is there a recommended way to clean up and restart that?

All the directories/locations to clean up are specified on Rancher Docs: Removing Kubernetes Components from Nodes, or you can use new/recreated nodes.

So I started from clean scorched earth (vms) and rebuilt an 8 node cluster. 2 cp, 3 etcd, and 3 workers. In general the cluster seems fine. I am still running into this when registering the cluster to rancher from the dashboard:

This cluster is currently **Provisioning** ; areas that interact directly with it will not be available until the API is ready.

[[network] Host [10.70.12.195] is not able to connect to the following ports: [10.70.12.193:2379]. Please check network policies and firewall rules]

And the docker logs from the etcd hosts
etcd node 1:

2021-05-27 18:07:30.223796 I | embed: rejected connection from "10.70.12.194:25476" (error "remote error: tls: bad certificate", ServerName "kubelet-etcd-pgh01.eagleaccess.com")
[root@kubelet-etcd-pgh01 ~]# hostname -i
10.70.12.193

etcd node 2:
2021-05-27 18:29:21.326779 I | embed: rejected connection from "10.70.12.193:34964" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "kubelet-etcd-pgh02.eagleaccess.com")

etcd node 3:
2021-05-27 18:33:34.383404 I | embed: rejected connection from "10.70.12.193:37380" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "kubelet-etcd-pgh03.eagleaccess.com")

Both 2379 and 2380 have etcd running on all etcd nodes. The firewalls are all:
ports: 2376/tcp 2379/tcp 2380/tcp 6443/tcp 9099/tcp 10250/tcp 10254/tcp 30000-32767/tcp 8472/udp 30000-32767/udp

How do I fix the SSL issue?

TIA.

Can you share the exact steps that are involved when you say and rebuilt an 8 node cluster. 2 cp, 3 etcd, and 3 workers. In general the cluster seems fine. I am still running into this when registering the cluster to rancher from the dashboard?

If you create a cluster in Rancher, you get the option to generate a docker run command to run on your nodes with the roles, but how you are saying it, it doesn’t line up with rebuilt an 8 node cluster and then run into this when registering the cluster to rancher from the dashboard?

The error around certificates still indicates the nodes used are not cleaned properly causing an issue with mismatched certificates.

The not able to connect is weird as it now suddenly pops up and not before? Are these newly created nodes or recycled existing nodes?

Sure. Thinking I had issues I was unable to resolve I started over totally, built new RKE cluster:
-rolled the 8 vms back to clean state (rhel8) via snapshots
-clean state is: firewall setup, rke admin acct/keys, rke binaries, kernel mods, docker-ce, ssh fwd, + helm (clean setup docker only installed - never run anything)
-edited/updated cluster.yml to use FQDNs (original cluster.yml generated via command line tool)
-executed rke up on one of the cp nodes
-after rke up, copy kube_config_cluster.yml to nodes for kubectl install/setup
-go to Rancher CP and create new cluster choosing existing nodes using defaults
-used the customize node run command to generate the docker run command for each node type, and exec’d that command on each node

Hmmm not sure if I’m following completely, let me explain how this works and see if that matches with what you are doing.

Having Rancher installed gives you multiple options to manage or create k8s clusters. One way is to import existing clusters, so for example, you have created a k8s cluster using a tool (could be RKE CLI, or kubeadm, or kops, or something else) and you want to manage that using Rancher, you should use the Import option and run the provided kubectl command to deploy resources into the cluster so Rancher can manage it. Another option is to have Rancher create the cluster for you, for this you don’t need any other tool, just create a custom cluster and run the provided docker run on every node you want to add to that cluster (with the role(s) you want).

It seems to me that you are mixing those options currently. Either import an existing cluster or create the cluster using Rancher only.

Let me know if I didn’t get it right.

OMG. I was mixing methods. Now that I have seen the error of my ways, it’s soo easy. I now have a nice new 8 node cluster.

Thanks Seb, have a great weekend.

1 Like