Single Node using Docker on AWS EC2 instance
Docker version: 20.10.12
OS: Ubuntu 20.04.4 CIS ami
Kubernetes Cluster RKE Cluster - Amazon EC2
Kubernetes version: 1.21.10
We have been using Rancher for more than 2 years and currently manage 5 Kubernetes clusters.
A few weeks ago we upgraded Rancher step by step from v2.6.2 to v2.6.5.
To upgrade from v2.6.2 to v2.6.3 we had to use the workaround as described in the release notes.
All of our Kubernetes clusters are deployed using Terraform and the Rancher2 provider.
Since we started using Rancher with v2.6.5, it is no longer possible to perform Kubernetes updates via the Rancher UI or Terraform module.
The update process starts, the controller node is shown as cordened in the Rancher UI, but after a short time we get the following error and the cluster goes into error state.
2022/06/15 11:26:25 [ERROR] Host playground-controller-1 failed to report Ready status with error: [controlplane] Error getting node playground-controller-1: "playground-controller-1" not found
Even if we wait, the problem will not solve itself. In a few tests, we even waited a few days. In the Rancher logs I see that Rancher tries over and over again to upgrade, but this fails because it can’t find the controller node anymore.
The nodes are all green in the Rancher UI.
If we perform an uncordon on the controller node, unfortunately this doesn’t change anything either.
I can reproduce this error at any time.
- deploy new rancher (v2.6.2) on an EC2 instance.
- deployed a Kubernetes cluster (v1.21.10-rancher1-1) with the same Terraform module.
- created a backup of Rancher and a snapshot of the cluster
- upgraded Rancher step by step from v2.6.2 to v2.6.5. Used workaround for updating v2.6.2 to v2.6.3.
- via UI edited the cluster and selected as Kubernetes version v1.21.12-rancher1-1.
We have also tried deploying the cluster with a different Kubernetes version (1.22.9) and then upgrading to version 1.23.6 via the UI or Terraform.
The update always stuck with the same error.
Does anyone have any idea what the error could be or what we are doing wrong?
If necessary, I can also provide the logs from the update.
We are already trying to solve the problem for 2 weeks, but unfortunately without success. On the Internet, unfortunately, we have also not yet found a solution. We are grateful for any help.