Unable to build or upgrade cluster due to validation error

We’ve been using Rancher for a couple of months, and recently noticed that we can’t build or update clusters in our Rancher HA install. When editing the version list is populated, but any attempt to upgrade fails with a message like:

Failed to validate cluster: v1.16.6-rancher1-1 is an unsupported Kubernetes version and system images are not populated: calico ctl image is not populated

(example cluster is currently v1.16.3-rancher1-1)

I’ve tried v1.16.4-rancher1-1, v1.16.6-rancher1-1 and v1.17.2-rancher1-1 with the same result. I’ve tried upgrading from Rancher 2.3.3 -> 2.3.4 and upgrading the underlying K8s in the rancher cluster with no improvements.

Building a new cluster the error is:

Can not find RKE state file: open /var/lib/rancher/management-state/rke/rke-132122861/cluster.rkestate: no such file or directory

I’ve tried refreshing the RKE data, and I can see that the error comes from https://github.com/rancher/rke/blob/25e7f987775dbd0e71dac82d63c0df62f65ca053/cluster/validation.go#L302, but I don’t know why the data isn’t populated.

Anyone have any ideas?

Digging a little further, this seems to only apply to calico configured clusters, which I should have realized from the error. I can build canal or flannel cluster without issues.

Posted on slack, turned out to be a bug: https://github.com/rancher/rancher/issues/25106