RKE Cluster Upgrades breaks Rancher 2.4.5

Hello there,

At the moment I’m trying to upgrade the Kubernetes version of our RKE cluster from 1.17.2 to 1.17.6 or 1.18.3. But in none of the different attempts the Rancher component survived.

After the update by RKE I monitored the status of the cluster with kubectl. I could observe that the namespace cattle-system changed to the status Terminated and was then automatically deleted.

Rancher was installed using the following instructions:
https://rancher.com/docs/rancher/v2.x/en/installation/other-installation-methods/air-gap/install-rancher/
Version: 2.4.5

For the upgrade I use the RKE cli and the environment is in a kind of air-gapped environment, where the docker images can be obtained via an artifactory-mirror. The Helm Charts can be obtained directly from the internet via a configured HTTP proxy.

See the cluster configuration here: https://gist.github.com/twecker137/a96b190b2d9b6bcf30511dcc2ca3b22a

The cluster was installed using rke version v1.0.4.

To upgrade it I have tried version v1.1.3.

The upgrade is successful in the broadest sense, but I had to remove the directory /vol1/custom/log/kubernetes/kube-audit from the kube-api extra_binds, otherwise a duplicate bind error occurred.

Can anyone help me with what this could be? Also, a setup of the cluster using rke etcd snapshot-restore was not successful. As soon as I used another than the original Kubernetes version 1.17.2, the namespace was deleted while other components were also preserved.

We’re taking a look. WIll try to reproduce

Is the cluster.yaml (as given in the original issue) used to deploy the cluster using rke v1.0.4 ?

Yes, or to be more precise: We originally installed the cluster with rke v1.0.0 and Kubernetes v1.16.3-rancher1-1 and updated it already successfully with the same file and with rke v1.0.4 to the current kubernetes v1.17.2-rancher1-2.
But there where no changes in this file since the initial deployment, besides the backup section and the user addons which are usually active. We now had to comment them out during the cluster restore.