New Custom RKE2 cluster (v2.8.5, v1.28.15+rke2r1) stuck in "Upgrading"

Hey!

For a couple of days now, I’m stuck creating a new custom RKE2 cluster with Kubernetes v1.28.15+rke2r1 from Rancher v2.8.5.

Quick-Facts

  • Rancher v2.8.5
  • Kubernetes v1.28.15+rke2r1
  • 1x Control Plane Node
  • 3x Worker Node

Issue Description

Cluster is stuck in state “Upgrading” with message “Non-ready bootstrap machine(s) custom- and join url to be available on bootstrap node”. All nodes are in state “Waiting for Node Ref”.

Control Plane node successfully started and system pods are up-and-running (after I fiddled around with the node taints a bit, because of [BUG] helm-operation pod uses incorrect taint for RKE2 Control Plane/etcd nodes, blocking provisioning #46228 - rancher/rancher - GitHub.com).

I can also talk to the Kubernetes API and see all pods.

[user@control-plane-1 ~]$ sudo /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes
NAME                   STATUS   ROLES                       AGE     VERSION
control-plane-1        Ready    control-plane,etcd,master   4h26m   v1.28.15+rke2r1
[user@control-plane-1 ~]$ sudo /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get pods -A
NAMESPACE         NAME                                                    READY   STATUS      RESTARTS   AGE
calico-system     calico-kube-controllers-596b689854-v2hdq                1/1     Running     0          4h26m
calico-system     calico-node-gz6xj                                       1/1     Running     0          4h26m
calico-system     calico-typha-7fcd988b75-5zv8d                           1/1     Running     0          4h26m
cattle-system     cattle-cluster-agent-74b58f855c-594s2                   1/1     Running     0          3h27m
cattle-system     rancher-webhook-fd7599678-2nl4b                         1/1     Running     0          4h24m
kube-system       cloud-controller-manager-control-plane-1                1/1     Running     0          4h26m
kube-system       etcd-control-plane-1                                    1/1     Running     0          4h26m
kube-system       helm-install-rke2-calico-crd-rf2dz                      0/1     Completed   0          4h26m
kube-system       helm-install-rke2-calico-wmkqh                          0/1     Completed   1          4h26m
kube-system       helm-install-rke2-coredns-bb7wk                         0/1     Completed   0          4h26m
kube-system       helm-install-rke2-metrics-server-5tlvh                  0/1     Completed   0          4h26m
kube-system       helm-install-rke2-snapshot-controller-2fphw             0/1     Completed   1          4h26m
kube-system       helm-install-rke2-snapshot-controller-crd-fs59x         0/1     Completed   0          4h26m
kube-system       helm-install-rke2-snapshot-validation-webhook-qlkzs     0/1     Completed   0          4h26m
kube-system       kube-apiserver-control-plane-1                          1/1     Running     0          4h26m
kube-system       kube-controller-manager-control-plane-1                 1/1     Running     0          4h26m
kube-system       kube-proxy-control-plane-1                              1/1     Running     0          4h26m
kube-system       kube-scheduler-control-plane-1                          1/1     Running     0          4h26m
kube-system       rke2-coredns-rke2-coredns-7cf94cdd9f-dp22m              1/1     Running     0          4h26m
kube-system       rke2-coredns-rke2-coredns-autoscaler-694dcd9546-nnh69   1/1     Running     0          4h26m
kube-system       rke2-metrics-server-7694cf7d77-s52jc                    1/1     Running     0          4h25m
kube-system       rke2-snapshot-controller-5c9df4d7d6-rj4vh               1/1     Running     0          4h25m
kube-system       rke2-snapshot-validation-webhook-54f487ff94-pkmtn       1/1     Running     0          4h25m
tigera-operator   tigera-operator-55b858dcff-dvtn9                        1/1     Running     0          4h26m
[user@control-plane-1 ~]$ 

New worker nodes are stuck with this:

[user@worker-1 ~]$ sudo journalctl -u rancher-system-agent.service --since today
Jan 06 11:52:07 worker-1 systemd[1]: Started Rancher System Agent.
Jan 06 11:52:07 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:07+01:00" level=info msg="Rancher System Agent version v0.3.6 (41c07d0) is starting"
Jan 06 11:52:07 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:07+01:00" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Jan 06 11:52:07 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:07+01:00" level=debug msg="Instantiated new image utility with imagesDir: /var/lib/rancher/agent/images, imageCredentialProviderConfig: /var/lib/rancher/credentialprovider/config.yaml, imageCredentialProviderBinDir: /var/lib/rancher/credentialprovider/bin, agentRegistriesFile: /etc/rancher/agent/registries.yaml"
Jan 06 11:52:07 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:07+01:00" level=info msg="Starting remote watch of plans"
Jan 06 11:52:07 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:07+01:00" level=info msg="Starting /v1, Kind=Secret controller"
Jan 06 11:52:07 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:07+01:00" level=debug msg="[K8s] Processing secret custom-37a31230f094-machine-plan in namespace fleet-default at generation 0 with resource version 720855112"
Jan 06 11:52:12 worker-1 rancher-system-agent[192252]: time="2025-01-06T11:52:12+01:00" level=debug msg="[K8s] Processing secret custom-37a31230f094-machine-plan in namespace fleet-default at generation 0 with resource version 720855112"
[ ... ]
Jan 06 16:00:59 worker-1 rancher-system-agent[192252]: time="2025-01-06T16:00:59+01:00" level=debug msg="[K8s] Processing secret custom-37a31230f094-machine-plan in namespace fleet-default at generation 0 with resource version 720855112"
Jan 06 16:01:04 worker-1 rancher-system-agent[192252]: time="2025-01-06T16:01:04+01:00" level=debug msg="[K8s] Processing secret custom-37a31230f094-machine-plan in namespace fleet-default at generation 0 with resource version 720855112"
Jan 06 16:01:09 worker-1 rancher-system-agent[192252]: time="2025-01-06T16:01:09+01:00" level=debug msg="[K8s] Processing secret custom-37a31230f094-machine-plan in namespace fleet-default at generation 0 with resource version 720855112"
[user@worker-1 ~]$ 

Additional Things I Tried

  • opened up all ports according to the port requirements documentation (https://ranchermanager.docs.rancher.com/getting-started/installation-and-upgrade/installation-requirements/port-requirements#ports-for-rancher-server-nodes-on-rke2, I’m not allowed to poost more than 2 links…)
  • deactivated firewalld altogether
  • used one (and two) of the worker node as additional control-plane nodes
  • installed the control-plane node with all three roles “control-plane”, “etcd”, and “worker”

Nothing really changed anything. No additional node shows up in the cluster. It’s always only the first (control-plane) node, that joined.

Updating Rancher to v2.9.x

I’m a little desperate right now and thought, if this is maybe a Rancher bug? I would like to update, but I have a cluster, that is still in K8s 1.26, that I cannot update just yet (for internal reasons).

Support dropped for this version in Rancher v2.9 (see Remove support for K8s 1.25 and 1.26 for Rancher 2.9.0 #45882 - rancher/rancher - GitHub.com). What would happen, if I upgrade Rancher anyways, while the cluster still runs on 1.26? Will I be able to update to 1.27+ afterwards?

Any help is very welcome! Thanks in advance!

Looking at this again, I think I’m missing the fleet-agent, right? When and where should this be installed? This should happen automatically, right?