Rancher v2.6.5 - Kubernetes Cluster update failed - controller node not found

Daniela · June 15, 2022, 11:52am

Single Node using Docker on AWS EC2 instance
Rancher: v2.6.5
Docker version: 20.10.12
OS: Ubuntu 20.04.4 CIS ami

Kubernetes Cluster RKE Cluster - Amazon EC2
OS: Flatcar
Kubernetes version: 1.21.10

Hello together,
We have been using Rancher for more than 2 years and currently manage 5 Kubernetes clusters.
A few weeks ago we upgraded Rancher step by step from v2.6.2 to v2.6.5.
To upgrade from v2.6.2 to v2.6.3 we had to use the workaround as described in the release notes.
All of our Kubernetes clusters are deployed using Terraform and the Rancher2 provider.

Since we started using Rancher with v2.6.5, it is no longer possible to perform Kubernetes updates via the Rancher UI or Terraform module.
The update process starts, the controller node is shown as cordened in the Rancher UI, but after a short time we get the following error and the cluster goes into error state.

2022/06/15 11:26:25 [ERROR] Host playground-controller-1 failed to report Ready status with error: [controlplane] Error getting node playground-controller-1:  "playground-controller-1" not found

Even if we wait, the problem will not solve itself. In a few tests, we even waited a few days. In the Rancher logs I see that Rancher tries over and over again to upgrade, but this fails because it can’t find the controller node anymore.

The nodes are all green in the Rancher UI.
If we perform an uncordon on the controller node, unfortunately this doesn’t change anything either.

I can reproduce this error at any time.

deploy new rancher (v2.6.2) on an EC2 instance.
deployed a Kubernetes cluster (v1.21.10-rancher1-1) with the same Terraform module.
created a backup of Rancher and a snapshot of the cluster
upgraded Rancher step by step from v2.6.2 to v2.6.5. Used workaround for updating v2.6.2 to v2.6.3.
via UI edited the cluster and selected as Kubernetes version v1.21.12-rancher1-1.

We have also tried deploying the cluster with a different Kubernetes version (1.22.9) and then upgrading to version 1.23.6 via the UI or Terraform.
The update always stuck with the same error.

Does anyone have any idea what the error could be or what we are doing wrong?
If necessary, I can also provide the logs from the update.
We are already trying to solve the problem for 2 weeks, but unfortunately without success. On the Internet, unfortunately, we have also not yet found a solution. We are grateful for any help.

Thanks,
Daniela

Daniela · July 8, 2022, 6:22am

After further testing, we found that this problem only happens since Rancher version 2.6.4. Do none of you have this problem or an idea what it could be?

Aransh · November 23, 2022, 6:18pm

Hey, seems we’re on the same boat
We have some additional details here:

github.com/rancher/rancher

It is no longer possible to update Kubernetes version since Rancher 2.6.4

opened 06:15AM - 08 Jul 22 UTC

killerquitsche

**Rancher Server Setup** - Rancher version: v2.6.5 - Installation option (Dock…er install/Helm Chart): - Docker (single Node install on AWS EC2 instance) - Docker version: 20.10.12 - OS: Ubuntu 20.04.4 CIS AMI **Information about the Cluster** - Kubernetes version: 1.21.10 - Cluster Type (Local/Downstream): - Downstream (1x etcd, 1x controller, 3x Worker) - Infrastructure Provider = Rancher provisioning the nodes using AWS node driver - Hosted: RKE Cluster - Amazon EC2 - OS on each node: Flatcar **User Information** - What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) - Admin **Describe the bug** Cluster goes into Error state during a Kubernetes update. **To Reproduce** - Provision an EC2 instance and install Rancher in version 2.6.4 or 2.6.5 (docker single node install) - Deploy a Kubernetes Cluster in Rancher - Edit the cluster configuration and increase the Kubernetes version to initiate an update **Result** - The update process starts, the controller node is shown as cordoned in the Rancher UI, but after a short time we get the following error and the cluster goes into error state. `Host xxxxx-controller-1 failed to report Ready status with error: [controlplane] Error getting node xxxxx-controller-1: "xxxxxxx-controller-1" not found` **Expected Result** The controller node should be successfully updated at some point and in the Rancher UI the cordoned should disappear at the controller node and update the next node until the whole cluster is updated. After that, the cluster should be available again normally. **Screenshots** **Additional context** We have already tried different Rancher version and Kubernetes versions. We noticed that this problem only appears in Rancher 2.6.4 and higher. - Rancher 2.6.3 - Kubernetes Cluster 1.21.10 to 1.21.12 -> No Problems ------------------- - Rancher 2.6.4 - Kubernetes Cluster 1.21.10 to 1.21.12 - > Update does not work -------------------- - Rancher 2.6.4 - Kubernetes Cluster 1.22.4 to 1.22.5 - > Update does not work ------------------- - Rancher 2.6.5 - Kubernetes Cluster 1.21.13 - Update Kubernetes Cluster to 1.22.4 - > Update does not work

But we have yet to receive an official response from rancher

Topic		Replies	Views
Cluster in error, new Rancher v. 2.5.2 deployment Rancher	0	1562	November 29, 2020
RKE cluster stuck Rancher	0	185	July 10, 2024
After upgrading Rancher from 2.4.x to 2.6.6, k8s version is not updated on 2 worker nodes Rancher	0	291	August 3, 2022
After Upgrade from 2.3.5 to 2.4.2: Failed to communicate with API Server Rancher	2	4085	October 19, 2021
Failed K8s upgrade from v1.15 to 1.16 - gone bad Rancher	0	2302	April 8, 2020

Rancher v2.6.5 - Kubernetes Cluster update failed - controller node not found

Related topics