Rancher 2.7 on Docker fails start after server reboot

Hi everybody,

a few days ago we “installed” rancher on a dedicated VM (Ubuntu 22.04 LTS) via docker with this docker command:

docker run -d --restart=unless-stopped \
  -p 80:80 -p 443:443 \
  -v /mnt/data/rancher:/var/lib/rancher \
  --privileged \
  rancher/rancher:v2.7-head

After that we created a Custom K8S Cluster with rancher, everything just works fine for days.
Today we had to reboot the VM with the Rancher Docker Container. After the reboot the container fails to start and enters a restart loop. This is the output from docker logs:

INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 19:05:49 [INFO] Rancher version 9bb8d5674 (9bb8d5674) is starting
2023/02/02 19:05:49 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 19:05:49 [INFO] Listening on /tmp/log.sock
2023/02/02 19:05:49 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:05:51 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:05:53 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:05:55 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:06:16 [FATAL] 2 errors occurred:
	* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": proxy error from 127.0.0.1:6443 while dialing 10.42.0.12:9443, code 503: 503 Service Unavailable
	* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": context deadline exceeded


INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 19:06:35 [INFO] Rancher version 9bb8d5674 (9bb8d5674) is starting
2023/02/02 19:06:35 [INFO] Listening on /tmp/log.sock
2023/02/02 19:06:35 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 19:06:35 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:06:37 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:06:39 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:06:41 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:02 [FATAL] 2 errors occurred:
	* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": proxy error from 127.0.0.1:6443 while dialing 10.42.0.12:9443, code 503: 503 Service Unavailable
	* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": context deadline exceeded


INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 19:07:20 [INFO] Rancher version 9bb8d5674 (9bb8d5674) is starting
2023/02/02 19:07:20 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 19:07:20 [INFO] Listening on /tmp/log.sock
2023/02/02 19:07:20 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:22 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:24 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:26 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:28 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:30 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused

Has anybody an idea whats going wrong here?
Thx in advance
T0mcat

Additional Infos:
We just made some further tests… the error is completely reproducable even on fresh rancher 2.7 containers:

  • Install Ubuntu Server 22.04 LTS, no additional software but ssh server
  • Install docker (apt install docker.io)
  • Take the above command for spinning up a rancher 2.7 container
  • Wait for Rancher to become ready, login to the Web Frontend, everything looks fine.
  • Reboot the Server
  • Container is in a reboot loop, docker logs looks like this:
INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 20:26:59 [INFO] Rancher version 9bb8d5674 (9bb8d5674) is starting
2023/02/02 20:26:59 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 20:26:59 [INFO] Listening on /tmp/log.sock
2023/02/02 20:26:59 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:01 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:03 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:05 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:07 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:09 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:11 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:28 [FATAL] 1 error occurred:
	* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": proxy error from 127.0.0.1:6443 while dialing 10.42.0.12:9443, code 503: 503 Service Unavailable

Do the identical tests with rancher 2.6:
After the server reboot everything works as expected: rancher comes up normally, web frontend is available after a short time, this is in docker logs:

INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 20:35:24 [INFO] Listening on /tmp/log.sock
2023/02/02 20:35:24 [INFO] Rancher version f7024783a (f7024783a) is starting
2023/02/02 20:35:24 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 20:35:24 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:26 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:28 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:30 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:32 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:34 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:36 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:38 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:55 [INFO] Running in single server mode, will not peer connections
2023/02/02 20:35:55 [INFO] Applying CRD features.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD navlinks.ui.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD clusters.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD apiservices.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD clusterregistrationtokens.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD settings.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD preferences.management.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD features.management.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD clusterrepos.catalog.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD operations.catalog.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD apps.catalog.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD fleetworkspaces.management.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD managedcharts.management.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD clusters.provisioning.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD clusters.provisioning.cattle.io
[....]

So Everything OK with 2.6, the Container “survives” a reboot… so IMHO there is some sort of bug in rancher 2.7 Docker Image, isn’t it?

You need to persist tghe ca-certifactes as well with 2.7, so add:

-v ...ca-certificates:/var/lib/ca-certificates 

to your docker invocation.

@bpedersen2
Thx for your reply… after this I again made an identical test with exact the same procedure described above… and now everything works just fine, I’m not able to reproduce this behaviour any more.
Even without using an external volume for ca-certificates the container survives a server reboot. I can stop/start the container, remove the container and run a complete new one… in all cases it simply works… perhaps a newer image nowadays? (Image Tag “:2.7-head” definitely pulled the latest one from docker hub)

For production usage you probably want 2.7.1 ( that is latest stable release), -head is a rc for 2.7.2 ( always check github releases…)

Absolutely… Tests with v2.7.1. also where successful, so I will use this one for production use.