Failed Rancher upgrade

randyrue · November 20, 2024, 7:27pm

I run a small metal kubernetes cluster managed by a single rancher instance running in docker on a VM. We take nightly tarballs of /var/lib/rancher and in the past I’ve been able to upgrade by:

stopping and deleting the rancher container
pulling the latest stable rancher container
starting the rancher

If the VM is also due for an upgrade I’ve been able to:

rebuild/replace the VM
pull and start the rancher container
stop the rancher container
delete and replace the /var/lib/rancher contents from the tarball
start the rancher container

This time it’s failing. The docker container goes into a ~30s restart loop. Each try goes pretty much:

INFO: Running k3s server --cluster-init --cluster-reset 2024/11/20 18:25:13 [INFO] Rancher version v2.10.0 (df45e368c82d4027410fa4700371982b9236b7c8) is starting 2024/11/20 18:25:13 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:} 2024/11/20 18:25:13 [INFO] Listening on /tmp/log.sock 2024/11/20 18:25:13 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused 2024/11/20 18:25:15 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused 2024/11/20 18:25:17 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused 2024/11/20 18:25:19 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused 2024/11/20 18:25:27 [INFO] Running in single server mode, will not peer connections 2024/11/20 18:25:30 [FATAL] Internal error occurred: failed calling webhook "rancher.cattle.io.namespaces.create-non-kubesystem": failed to call webhook: Post "[https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s](https://rancher-webhook.cattle-system.svc/v1/webhook/validation/namespaces?timeout=10s)": proxy error from 127.0.0.1:6443 while dialing 10.42.0.76:9443, code 502: 502 Bad Gateway

Later attempts also give instructions for removing the reset flag file if you want to try resetting again.Online searches give plenty of references to the “failed calling webhook” error but not for "Bad Gateway."Grateful for any guidance: at this point our prod cluster is running fine and I have kubectl access to it, but no graceful way to manage nodes or users/RBAC and so on.Forgot. Previous running Rancher was 2.8.5, now 2.10. (edited)

bpedersen2 · November 21, 2024, 8:25am

Some things to consider:

always use version-tagged docker images
use the rancher-buitlin back mechanism
read the changelogs/upgrade instructions for each release carefully before upgrading ( there were quite some changes in 2.9. and 2.10). makesure your downstream cluster is on a supported k8s version.
In your case, using the 2.8.x image and restore your backup should get you up again.

Topic		Replies	Views
Rancher 2.7 on Docker fails start after server reboot Rancher	5	6333	March 7, 2023
Fresh installation of Rancher Server fails - Container keeps restarting Rancher	4	3398	August 2, 2021
K3s fails to upgrade inside rancher 2.5.x Rancher	0	426	April 5, 2021
Rancher in Docker - helm-operation error Rancher	2	6554	December 6, 2022
Cluster unavailable after upgrading Rancher 2.6.8 to 2.6.14 Rancher	0	314	February 22, 2024

Failed Rancher upgrade

Related topics