Failed Rancher upgrade

I run a small metal kubernetes cluster managed by a single rancher instance running in docker on a VM. We take nightly tarballs of /var/lib/rancher and in the past I’ve been able to upgrade by:

  • stopping and deleting the rancher container
  • pulling the latest stable rancher container
  • starting the rancher

If the VM is also due for an upgrade I’ve been able to:

  • rebuild/replace the VM
  • pull and start the rancher container
  • stop the rancher container
  • delete and replace the /var/lib/rancher contents from the tarball
  • start the rancher container

This time it’s failing. The docker container goes into a ~30s restart loop. Each try goes pretty much:

INFO: Running k3s server --cluster-init --cluster-reset 2024/11/20 18:25:13 [INFO] Rancher version v2.10.0 (df45e368c82d4027410fa4700371982b9236b7c8) is starting 2024/11/20 18:25:13 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:} 2024/11/20 18:25:13 [INFO] Listening on /tmp/log.sock 2024/11/20 18:25:13 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused 2024/11/20 18:25:15 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused 2024/11/20 18:25:17 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused 2024/11/20 18:25:19 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused 2024/11/20 18:25:27 [INFO] Running in single server mode, will not peer connections 2024/11/20 18:25:30 [FATAL] Internal error occurred: failed calling webhook "rancher.cattle.io.namespaces.create-non-kubesystem": failed to call webhook: Post "[https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s](https://rancher-webhook.cattle-system.svc/v1/webhook/validation/namespaces?timeout=10s)": proxy error from 127.0.0.1:6443 while dialing 10.42.0.76:9443, code 502: 502 Bad Gateway

Later attempts also give instructions for removing the reset flag file if you want to try resetting again.Online searches give plenty of references to the “failed calling webhook” error but not for "Bad Gateway."Grateful for any guidance: at this point our prod cluster is running fine and I have kubectl access to it, but no graceful way to manage nodes or users/RBAC and so on.Forgot. Previous running Rancher was 2.8.5, now 2.10. (edited)

Some things to consider:

  • always use version-tagged docker images
  • use the rancher-buitlin back mechanism
  • read the changelogs/upgrade instructions for each release carefully before upgrading ( there were quite some changes in 2.9. and 2.10). makesure your downstream cluster is on a supported k8s version.
    In your case, using the 2.8.x image and restore your backup should get you up again.