"Waiting to register with Kubernetes" for days

Hello,

I’ve removed and replaced a controller node in our on-premise bare metal cluster. I removed it in Rancher, removed the docker container and all traces of it from the node, and attempted to re-add it. I’ve done this several times, and also reinstalled the node from bare metal one time.

OS is Ubuntu 18.04LTS, Rancher is 2.2.2 and Docker is 18.9.7.

When I copy and paste the docker run command to re-add the node (unchecked worker node and checked controller and etcd node), the Rancher webUI shows the cluster reconfiguring for the new node but stops at “Waiting to register with Kubernetes.” It’s sat there for days on one try.

If I click on the node in Rancher, the “speedo’s” and all information other than the IP address are blank. The four indicators for disk space/pressure etc do show red for “Kubelet.”

From the kubectl cli everything shows all is well. The cluster is fielding jobs as it should, and things like kubectl describe node show no problem at all with the added controller node:

**Name: km-alpha-m01
Roles: controlplane,etcd
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=km-alpha-m01
node-role.kubernetes.io/controlplane=true
node-role.kubernetes.io/etcd=true
Annotations: flannel.alpha.coreos.com/backend-data: {“VtepMAC”:“5e:e0:82:75:89:bb”}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 10.6.164.81
node.alpha.kubernetes.io/ttl: 0
rke.cattle.io/external-ip: 10.6.164.81
rke.cattle.io/internal-ip: 10.6.164.81
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 02 Apr 2020 10:01:34 -0700
Taints: node-role.kubernetes.io/etcd=true:NoExecute
node-role.kubernetes.io/controlplane=true:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


MemoryPressure False Thu, 02 Apr 2020 11:53:02 -0700 Thu, 02 Apr 2020 10:01:34 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 02 Apr 2020 11:53:02 -0700 Thu, 02 Apr 2020 10:01:34 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 02 Apr 2020 11:53:02 -0700 Thu, 02 Apr 2020 10:01:34 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 02 Apr 2020 11:53:02 -0700 Thu, 02 Apr 2020 11:22:30 -0700 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.6.164.81
Hostname: km-alpha-m01
Capacity:
cpu: 8
ephemeral-storage: 278436884Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 49435416Ki
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 256607431870
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 49333016Ki
pods: 110
System Info:
Machine ID: 3d6ff2f75c7d3ae927580249a28e7e05
System UUID: 4C4C4544-0053-4A10-805A-C8C04F4C4E31
Boot ID: 63918dd0-e35e-4224-8af3-46b040be0ed0
Kernel Version: 4.15.0-91-generic
OS Image: Ubuntu 18.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.6
Kubelet Version: v1.13.5
Kube-Proxy Version: v1.13.5
PodCIDR: 10.42.0.0/24
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE


cattle-prometheus exporter-node-cluster-monitoring-xwqsz 100m (1%) 200m (2%) 30Mi (0%) 200Mi (0%) 111m
cattle-system cattle-node-agent-fs9vs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 111m
cattle-system kube-api-auth-dbhtc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 111m
kube-system canal-sl8vc 250m (3%) 0 (0%) 0 (0%) 0 (0%) 111m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 350m (4%) 200m (2%)
memory 30Mi (0%) 200Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events:
Type Reason Age From Message


Normal Starting 30m kubelet, km-alpha-m01 Starting kubelet.
Normal NodeHasSufficientMemory 30m kubelet, km-alpha-m01 Node km-alpha-m01 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 30m kubelet, km-alpha-m01 Node km-alpha-m01 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 30m kubelet, km-alpha-m01 Node km-alpha-m01 status is now: NodeHasSufficientPID
Normal NodeNotReady 30m kubelet, km-alpha-m01 Node km-alpha-m01 status is now: NodeNotReady
Normal NodeAllocatableEnforced 30m kubelet, km-alpha-m01 Updated Node Allocatable limit across pods
Normal NodeReady 30m kubelet, km-alpha-m01 Node km-alpha-m01 status is now: NodeReady
**

I’ve also tried “docker restart kubelet” on the new node.

Any guidance would be appreciated

OK, I removed the stuck controller node, waited until the cluster was stable, and upgprade (replaced) my rancher instance with the latest stable version 2.3.6.

Re-added the controller/etcd node with exactly the same results. New node is stuck at “Waiting to register with Kubernetes” and if I click on the node the Kubelet flag is red.

what does this flag mean, anyway?
image

docker ps on the VM shows the kubelet container running. kubectl describe node also shows kubelet appearing to start and run normally

Hoping to hear from you.

I suggest to check the logs of the kubelet container. It will show you the error.

Hello,
when I do similar operations on my Rancher clusters’ nodes, prior to re-add them I do a “rancher_node_cleanup.sh” with the following script, I’ve found on Rancher websites a while ago, back in the early-2.0 days.

#!/bin/bash
# Docker cleanup
docker rm -f (docker ps -qa) docker rmi -f (docker images -q)
docker volume rm $(docker volume ls -q)

# Mounts cleanup
sudo -v
for mount in $(mount | grep tmpfs | grep ‘/var/lib/kubelet’ | awk ‘{ print $3 }’) /var/lib/kubelet /var/lib/rancher;
do
sudo umount $mount;
done
# Directory cleanup
sudo rm -rf /etc/ceph
/etc/cni
/etc/kubernetes
/opt/cni
/opt/rke
/run/secrets/kubernetes.io
/run/calico
/run/flannel
/var/lib/calico
/var/lib/etcd
/var/lib/cni
/var/lib/kubelet
/var/lib/rancher/rke/log
/var/log/containers
/var/log/pods
/var/run/calico

(can’t get rid of formatting so comment - hash pounds are always eaten as markup…)
Adalberto

kubelet didn’t show any errors

but I did notice the re-added node was running a newer version of docker. I upgraded rancher from 2.2.2 to 2.3.6, and ran all OS/package updates on each controller node. Once all three controller nodes were using docker 19.3.6 the stuck node finished registering. For good measure I did the same on all worker nodes. Now all appears to be well. In fact, the flapping rancher connection to the cluster (different issue, check my other posts) also seems to have gone away. Possibly fixed in rancher 2.3.6?