Hey!
For a couple of days now, I’m stuck creating a new custom RKE2 cluster with Kubernetes v1.28.15+rke2r1 from Rancher v2.8.5.
Quick-Facts
- Rancher v2.8.5
- Kubernetes v1.28.15+rke2r1
- 1x Control Plane Node
- 3x Worker Node
Issue Description
Cluster is stuck in state “Upgrading” with message “Non-ready bootstrap machine(s) custom- and join url to be available on bootstrap node”. All nodes are in state “Waiting for Node Ref”.
Control Plane node successfully started and system pods are up-and-running (after I fiddled around with the node taints a bit, because of [BUG] helm-operation pod uses incorrect taint for RKE2 Control Plane/etcd nodes, blocking provisioning #46228 - rancher/rancher - GitHub.com).
I can also talk to the Kubernetes API and see all pods.
[user@control-plane-1 ~]$ sudo /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes
NAME STATUS ROLES AGE VERSION
control-plane-1 Ready control-plane,etcd,master 4h26m v1.28.15+rke2r1
[user@control-plane-1 ~]$ sudo /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
calico-system calico-kube-controllers-596b689854-v2hdq 1/1 Running 0 4h26m
calico-system calico-node-gz6xj 1/1 Running 0 4h26m
calico-system calico-typha-7fcd988b75-5zv8d 1/1 Running 0 4h26m
cattle-system cattle-cluster-agent-74b58f855c-594s2 1/1 Running 0 3h27m
cattle-system rancher-webhook-fd7599678-2nl4b 1/1 Running 0 4h24m
kube-system cloud-controller-manager-control-plane-1 1/1 Running 0 4h26m
kube-system etcd-control-plane-1 1/1 Running 0 4h26m
kube-system helm-install-rke2-calico-crd-rf2dz 0/1 Completed 0 4h26m
kube-system helm-install-rke2-calico-wmkqh 0/1 Completed 1 4h26m
kube-system helm-install-rke2-coredns-bb7wk 0/1 Completed 0 4h26m
kube-system helm-install-rke2-metrics-server-5tlvh 0/1 Completed 0 4h26m
kube-system helm-install-rke2-snapshot-controller-2fphw 0/1 Completed 1 4h26m
kube-system helm-install-rke2-snapshot-controller-crd-fs59x 0/1 Completed 0 4h26m
kube-system helm-install-rke2-snapshot-validation-webhook-qlkzs 0/1 Completed 0 4h26m
kube-system kube-apiserver-control-plane-1 1/1 Running 0 4h26m
kube-system kube-controller-manager-control-plane-1 1/1 Running 0 4h26m
kube-system kube-proxy-control-plane-1 1/1 Running 0 4h26m
kube-system kube-scheduler-control-plane-1 1/1 Running 0 4h26m
kube-system rke2-coredns-rke2-coredns-7cf94cdd9f-dp22m 1/1 Running 0 4h26m
kube-system rke2-coredns-rke2-coredns-autoscaler-694dcd9546-nnh69 1/1 Running 0 4h26m
kube-system rke2-metrics-server-7694cf7d77-s52jc 1/1 Running 0 4h25m
kube-system rke2-snapshot-controller-5c9df4d7d6-rj4vh 1/1 Running 0 4h25m
kube-system rke2-snapshot-validation-webhook-54f487ff94-pkmtn 1/1 Running 0 4h25m
tigera-operator tigera-operator-55b858dcff-dvtn9 1/1 Running 0 4h26m
[user@control-plane-1 ~]$
New worker nodes are stuck with this:
[user@worker-1 ~]$ sudo journalctl -u rancher-system-agent.service --since today
Jan 06 11:52:07 worker-1 systemd[1]: Started Rancher System Agent.
Jan 06 11:52:07 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:07+01:00" level=info msg="Rancher System Agent version v0.3.6 (41c07d0) is starting"
Jan 06 11:52:07 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:07+01:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Jan 06 11:52:07 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:07+01:00" level=debug msg="Instantiated new image utility with imagesDir: /var/lib/rancher/agent/images, imageCredentialProviderConfig: /var/lib/rancher/credentialprovider/config.yaml, imageCredentialProviderBinDir: /var/lib/rancher/credentialprovider/bin, agentRegistriesFile: /etc/rancher/agent/registries.yaml"
Jan 06 11:52:07 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:07+01:00" level=info msg="Starting remote watch of plans"
Jan 06 11:52:07 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:07+01:00" level=info msg="Starting /v1, Kind=Secret controller"
Jan 06 11:52:07 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:07+01:00" level=debug msg="[K8s] Processing secret custom-37a31230f094-machine-plan in namespace fleet-default at generation 0 with resource version 720855112"
Jan 06 11:52:12 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:12+01:00" level=debug msg="[K8s] Processing secret custom-37a31230f094-machine-plan in namespace fleet-default at generation 0 with resource version 720855112"
[ ... ]
Jan 06 16:00:59 worker-1 rancher-system-agent[192252]: time="2025-01-06T16:00:59+01:00" level=debug msg="[K8s] Processing secret custom-37a31230f094-machine-plan in namespace fleet-default at generation 0 with resource version 720855112"
Jan 06 16:01:04 worker-1 rancher-system-agent[192252]: time="2025-01-06T16:01:04+01:00" level=debug msg="[K8s] Processing secret custom-37a31230f094-machine-plan in namespace fleet-default at generation 0 with resource version 720855112"
Jan 06 16:01:09 worker-1 rancher-system-agent[192252]: time="2025-01-06T16:01:09+01:00" level=debug msg="[K8s] Processing secret custom-37a31230f094-machine-plan in namespace fleet-default at generation 0 with resource version 720855112"
[user@worker-1 ~]$
Additional Things I Tried
- opened up all ports according to the port requirements documentation (
https://ranchermanager.docs.rancher.com/getting-started/installation-and-upgrade/installation-requirements/port-requirements#ports-for-rancher-server-nodes-on-rke2
, I’m not allowed to poost more than 2 links…) - deactivated firewalld altogether
- used one (and two) of the worker node as additional control-plane nodes
- installed the control-plane node with all three roles “control-plane”, “etcd”, and “worker”
Nothing really changed anything. No additional node shows up in the cluster. It’s always only the first (control-plane) node, that joined.
Updating Rancher to v2.9.x
I’m a little desperate right now and thought, if this is maybe a Rancher bug? I would like to update, but I have a cluster, that is still in K8s 1.26, that I cannot update just yet (for internal reasons).
Support dropped for this version in Rancher v2.9 (see Remove support for K8s 1.25 and 1.26 for Rancher 2.9.0 #45882 - rancher/rancher - GitHub.com). What would happen, if I upgrade Rancher anyways, while the cluster still runs on 1.26? Will I be able to update to 1.27+ afterwards?
Any help is very welcome! Thanks in advance!