Rancher v2.6.5 - Kubernetes Cluster update failed - controller node not found

Single Node using Docker on AWS EC2 instance
Rancher: v2.6.5
Docker version: 20.10.12
OS: Ubuntu 20.04.4 CIS ami

Kubernetes Cluster RKE Cluster - Amazon EC2
OS: Flatcar
Kubernetes version: 1.21.10

Hello together,
We have been using Rancher for more than 2 years and currently manage 5 Kubernetes clusters.
A few weeks ago we upgraded Rancher step by step from v2.6.2 to v2.6.5.
To upgrade from v2.6.2 to v2.6.3 we had to use the workaround as described in the release notes.
All of our Kubernetes clusters are deployed using Terraform and the Rancher2 provider.

Since we started using Rancher with v2.6.5, it is no longer possible to perform Kubernetes updates via the Rancher UI or Terraform module.
The update process starts, the controller node is shown as cordened in the Rancher UI, but after a short time we get the following error and the cluster goes into error state.

2022/06/15 11:26:25 [ERROR] Host playground-controller-1 failed to report Ready status with error: [controlplane] Error getting node playground-controller-1:  "playground-controller-1" not found

Even if we wait, the problem will not solve itself. In a few tests, we even waited a few days. In the Rancher logs I see that Rancher tries over and over again to upgrade, but this fails because it can’t find the controller node anymore.

The nodes are all green in the Rancher UI.
If we perform an uncordon on the controller node, unfortunately this doesn’t change anything either.

I can reproduce this error at any time.

  1. deploy new rancher (v2.6.2) on an EC2 instance.
  2. deployed a Kubernetes cluster (v1.21.10-rancher1-1) with the same Terraform module.
  3. created a backup of Rancher and a snapshot of the cluster
  4. upgraded Rancher step by step from v2.6.2 to v2.6.5. Used workaround for updating v2.6.2 to v2.6.3.
  5. via UI edited the cluster and selected as Kubernetes version v1.21.12-rancher1-1.

We have also tried deploying the cluster with a different Kubernetes version (1.22.9) and then upgrading to version 1.23.6 via the UI or Terraform.
The update always stuck with the same error.

Does anyone have any idea what the error could be or what we are doing wrong?
If necessary, I can also provide the logs from the update.
We are already trying to solve the problem for 2 weeks, but unfortunately without success. On the Internet, unfortunately, we have also not yet found a solution. We are grateful for any help.


After further testing, we found that this problem only happens since Rancher version 2.6.4. Do none of you have this problem or an idea what it could be?

Hey, seems we’re on the same boat
We have some additional details here:

But we have yet to receive an official response from rancher