New nodes are endlessly "Waiting to register with Kubernetes"

We have a Rancher Provisioned cluster (VMware driver) and when we scale up new worker nodes they sit there forever at “Waiting to register with Kubernetes”. This is Rancher 2.5.6, and the cluster itself is v1.19.7-rancher1-1.

I sshed into the newly created node and looked at the rancher-agent container logs there is no errors:

[root@mynewnode1 ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
06f23a8797e5 rancher/rancher-agent:v2.5.6 “run.sh --server htt…” 2 minutes ago Up 2 minutes quirky_robinson
[root@lmynewnode1 ~]# docker logs 06f23a8797e5
INFO: Arguments: --server https://rancher.acme.com --token REDACTED --ca-checksum 8xxxxxxxxxxxxxxxxe1076 -r -n m-skl8s
INFO: Environment: CATTLE_ADDRESS=10.5.34.204 CATTLE_AGENT_CONNECT=true CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=m-skl8s CATTLE_SERVER=https://rancher.acme.com CATTLE_TOKEN=REDACTED
INFO: Using resolv.conf: search acme.com nameserver 10.6.1.5 nameserver 10.6.1.6
INFO: https://rancher.acme.com/ping is accessible
INFO: rancher.acme.com resolves to 10.6.215.162
INFO: Value from https://rancher.acme.com/v3/settings/cacerts is an x509 certificate
time=“2021-05-05T18:04:55Z” level=info msg=“Listening on /tmp/log.sock”
time=“2021-05-05T18:04:55Z” level=info msg=“Rancher agent version v2.5.6 is starting”
time=“2021-05-05T18:04:55Z” level=info msg=“Option customConfig=map[address:10.5.34.204 internalAddress: label:map[] roles:[] taints:[]]”
time=“2021-05-05T18:04:55Z” level=info msg=“Option etcd=false”
time=“2021-05-05T18:04:55Z” level=info msg=“Option controlPlane=false”
time=“2021-05-05T18:04:55Z” level=info msg=“Option worker=false”
time=“2021-05-05T18:04:55Z” level=info msg=“Option requestedHostname=m-skl8s”
time=“2021-05-05T18:04:55Z” level=info msg=“Connecting to wss://rancher.acme.com/v3/connect with token wxxxxxxxxxxxxxxxxxxxd”
time=“2021-05-05T18:04:55Z” level=info msg=“Connecting to proxy” url=“wss://rancher.acme.com/v3/connect”
time=“2021-05-05T18:04:55Z” level=info msg=“Starting plan monitor, checking every 15 seconds”

I enabled debugging but it just shows the rancher-agent pinging. Is there a way to set loglevel debug not just at runtime but permanently so I can restart the container and see what might be wrong?

[root@mynewnode1 ~]# docker exec -it 06f23a8797e5 loglevel --set debug
OK
[root@mynewnode1 ~]# docker logs -f 06f23a8797e5
time=“2021-05-05T18:14:20Z” level=debug msg=“Wrote ping”
time=“2021-05-05T18:14:25Z” level=debug msg=“Wrote ping”
time=“2021-05-05T18:14:25Z” level=debug msg=“Get agent config: &rkeworker.NodeConfig{ClusterName:“c-np6ds”, Certs:”", Processes:map[string]types.Process(nil), Files:[]types.File(nil), NodeVersion:0, AgentCheckInterval:15}"
time=“2021-05-05T18:14:30Z” level=debug msg=“Wrote ping”
time=“2021-05-05T18:14:35Z” level=debug msg=“Wrote ping”
time=“2021-05-05T18:14:40Z” level=debug msg=“Wrote ping”
time=“2021-05-05T18:14:40Z” level=debug msg=“Get agent config: &rkeworker.NodeConfig{ClusterName:“c-np6ds”, Certs:”", Processes:map[string]types.Process(nil), Files:[]types.File(nil), NodeVersion:0, AgentCheckInterval:15}"

Does it eventually work if you leave it running longer than 2 minutes? Is the status of the cluster you are adding the worker to Ready? How many nodes with what roles are in the cluster? You could also be running into After Rancher upgrade new clusters are up but the nodes will be stuck at "waiting to register" · Issue #31999 · rancher/rancher · GitHub

I came back today and the nodes did finally successfully create. So it is working just takes hours.

How fortitious, i see that Rancher v2.5.8 officially came out 7 hours ago and #31999 is included in that. Let me try to upgrade and see if it fixes it.

Thanks!

Observing this issue again after upgrade of Rancher to v2.5.12

Using Rancher v2.4.8, had created a 9-node k8s cluster on custom nodes (Linux VMs) with RKE (Kubernetes: v1.18.12). Cluster has been running for almost 2 years, this is the only downstream cluster managed by Rancher.

Recently, upgraded Rancher to v2.5.12 and kubernetes to v1.20.15
This was the cluster state before adding a new node

NAME STATUS ROLES AGE VERSION
Ready controlplane,etcd 668d v1.20.15
Ready controlplane,etcd 668d v1.20.15
Ready worker 221d v1.20.15
Ready worker 668d v1.20.15
Ready worker 668d v1.20.15
Ready worker 668d v1.20.15
Ready controlplane,etcd 668d v1.20.15
Ready worker 668d v1.20.15
Ready worker 668d v1.20.15

Attempted to add a node to the cluster with only the “worker” role.
Rancher UI → Edit Cluster → “Node Options” → Ticked only the “Worker” check box for adding the new node,

sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.5.12 --server https://<rancher_name.com> --token <xxxx…> --worker

The “rancher-agent” container on the node is stuck:
#########################################################################
INFO: Arguments: --server https://<rancher_name.com> --token REDACTED --worker
INFO: Environment: CATTLE_ADDRESS=<rancher_node_ip> CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=<node_name> CATTLE_ROLE=,worker CATTLE_SERVER=https://<rancher_name.com> CATTLE_TOKEN=REDACTED
INFO: Using resolv.conf: nameserver 8.8.8.8
INFO: https://<rancher_name.com>/ping is accessible
INFO: <rancher_name> resolves to <node_ip_1> <node_ip_2> <node_ip_3>
time=“2022-10-11T08:10:49Z” level=info msg=“Listening on /tmp/log.sock”
time=“2022-10-11T08:10:49Z” level=info msg=“Rancher agent version v2.5.12 is starting”
time=“2022-10-11T08:10:49Z” level=info msg=“Option customConfig=map[address:<node_ip_1> internalAddress: label:map roles:[worker] taints:]”
time=“2022-10-11T08:10:49Z” level=info msg=“Option etcd=false”
time=“2022-10-11T08:10:49Z” level=info msg=“Option controlPlane=false”
time=“2022-10-11T08:10:49Z” level=info msg=“Option worker=true”
time=“2022-10-11T08:10:49Z” level=info msg=“Option requestedHostname=<host_name>”
time=“2022-10-11T08:10:49Z” level=info msg=“Connecting to wss://<rancher_name.com>/v3/connect/register with token xxxx…”
time=“2022-10-11T08:10:49Z” level=info msg=“Connecting to proxy” url=“wss://<rancher_name.com>/v3/connect/register”
time=“2022-10-11T08:10:51Z” level=info msg=“Waiting for node to register. Either cluster is not ready for registering, cluster is currently provisioning, or etcd, controlplane and worker node have to be registered”
time=“2022-10-11T08:10:53Z” level=info msg=“Starting plan monitor, checking every 15 seconds”
##########################################################################

debug logs:
#################################################################
time=“2022-10-14T07:11:28Z” level=debug msg=“Get agent config: &rkeworker.NodeConfig{ClusterName:"c-gmgbl", Certs:"", Processes:map[string]types.Process(nil), Files:types.File(nil), NodeVersion:0, AgentCheckInterval:15}”
time=“2022-10-14T07:11:30Z” level=debug msg=“Wrote ping”
time=“2022-10-14T07:11:35Z” level=debug msg=“Wrote ping”
time=“2022-10-14T07:11:40Z” level=debug msg=“Wrote ping”
time=“2022-10-14T07:11:43Z” level=debug msg=“Get agent config: &rkeworker.NodeConfig{ClusterName:"c-gmgbl", Certs:"", Processes:map[string]types.Process(nil), Files:types.File(nil), NodeVersion:0, AgentCheckInterval:15}”
time=“2022-10-14T07:11:45Z” level=debug msg=“Wrote ping”
time=“2022-10-14T07:11:50Z” level=debug msg=“Wrote ping”
time=“2022-10-14T07:11:55Z” level=debug msg=“Wrote ping”
############################################################

In Rancher UI, state of the new node is “Registering” - “Waiting to register with Kubernetes”
Not sure where to check on the Rancher side for any interaction with the ‘rancher-agent’ on the new node,

Waited a couple of days, there was no progress. Deleted the node from Rancher UI, ran the cmd again on the node. No changes.

Deleted the node again and ran the docker cmd again, this time added all roles to the node,

sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.5.12 --server https://<rancher_name.comn> --token <xxxx…> --etcd --controlplane --worker

This time, the node successfully registered with the cluster.

NAME STATUS ROLES AGE VERSION
Ready controlplane,etcd 668d v1.20.15
Ready controlplane,etcd 668d v1.20.15
Ready worker 221d v1.20.15
Ready worker 668d v1.20.15
Ready worker 668d v1.20.15
Ready worker 668d v1.20.15
Ready controlplane,etcd 668d v1.20.15
Ready worker 668d v1.20.15
Ready worker 668d v1.20.15
<new_node> Ready controlplane,etcd,worker 4h31m v1.20.15

If another node with only “worker” role is added now, it probably will get registered cleanly. Wondering if the Rancher upgrade process initializes/resets some cluster info and adding new nodes after upgrade with only specific roles runs into issues,

Also,since there are 3 nodes already running with “controlplane, etcd” roles, wanted to remove those roles and retain only the “worker” role on the newly added node, but seems there isn’t an easy way to do this,

thanks