I run a small metal kubernetes cluster managed by a single rancher instance running in docker on a VM. We take nightly tarballs of /var/lib/rancher and in the past I’ve been able to upgrade by:
- stopping and deleting the rancher container
- pulling the latest stable rancher container
- starting the rancher
If the VM is also due for an upgrade I’ve been able to:
- rebuild/replace the VM
- pull and start the rancher container
- stop the rancher container
- delete and replace the /var/lib/rancher contents from the tarball
- start the rancher container
This time it’s failing. The docker container goes into a ~30s restart loop. Each try goes pretty much:
INFO: Running k3s server --cluster-init --cluster-reset 2024/11/20 18:25:13 [INFO] Rancher version v2.10.0 (df45e368c82d4027410fa4700371982b9236b7c8) is starting 2024/11/20 18:25:13 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:} 2024/11/20 18:25:13 [INFO] Listening on /tmp/log.sock 2024/11/20 18:25:13 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused 2024/11/20 18:25:15 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused 2024/11/20 18:25:17 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused 2024/11/20 18:25:19 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused 2024/11/20 18:25:27 [INFO] Running in single server mode, will not peer connections 2024/11/20 18:25:30 [FATAL] Internal error occurred: failed calling webhook "rancher.cattle.io.namespaces.create-non-kubesystem": failed to call webhook: Post "[https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s](https://rancher-webhook.cattle-system.svc/v1/webhook/validation/namespaces?timeout=10s)": proxy error from 127.0.0.1:6443 while dialing 10.42.0.76:9443, code 502: 502 Bad Gateway
Later attempts also give instructions for removing the reset flag file if you want to try resetting again.Online searches give plenty of references to the “failed calling webhook” error but not for "Bad Gateway."Grateful for any guidance: at this point our prod cluster is running fine and I have kubectl access to it, but no graceful way to manage nodes or users/RBAC and so on.Forgot. Previous running Rancher was 2.8.5, now 2.10. (edited)