Rancher 2.7 on Docker fails start after server reboot

t0mcat · February 2, 2023, 7:10pm

Hi everybody,

a few days ago we “installed” rancher on a dedicated VM (Ubuntu 22.04 LTS) via docker with this docker command:

docker run -d --restart=unless-stopped \
  -p 80:80 -p 443:443 \
  -v /mnt/data/rancher:/var/lib/rancher \
  --privileged \
  rancher/rancher:v2.7-head

After that we created a Custom K8S Cluster with rancher, everything just works fine for days.
Today we had to reboot the VM with the Rancher Docker Container. After the reboot the container fails to start and enters a restart loop. This is the output from docker logs:

INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 19:05:49 [INFO] Rancher version 9bb8d5674 (9bb8d5674) is starting
2023/02/02 19:05:49 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 19:05:49 [INFO] Listening on /tmp/log.sock
2023/02/02 19:05:49 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:05:51 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:05:53 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:05:55 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:06:16 [FATAL] 2 errors occurred:
	* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": proxy error from 127.0.0.1:6443 while dialing 10.42.0.12:9443, code 503: 503 Service Unavailable
	* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": context deadline exceeded


INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 19:06:35 [INFO] Rancher version 9bb8d5674 (9bb8d5674) is starting
2023/02/02 19:06:35 [INFO] Listening on /tmp/log.sock
2023/02/02 19:06:35 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 19:06:35 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:06:37 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:06:39 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:06:41 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:02 [FATAL] 2 errors occurred:
	* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": proxy error from 127.0.0.1:6443 while dialing 10.42.0.12:9443, code 503: 503 Service Unavailable
	* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": context deadline exceeded


INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 19:07:20 [INFO] Rancher version 9bb8d5674 (9bb8d5674) is starting
2023/02/02 19:07:20 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 19:07:20 [INFO] Listening on /tmp/log.sock
2023/02/02 19:07:20 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:22 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:24 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:26 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:28 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:30 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused

Has anybody an idea whats going wrong here?
Thx in advance
T0mcat

t0mcat · February 2, 2023, 8:41pm

Additional Infos:
We just made some further tests… the error is completely reproducable even on fresh rancher 2.7 containers:

Install Ubuntu Server 22.04 LTS, no additional software but ssh server
Install docker (apt install docker.io)
Take the above command for spinning up a rancher 2.7 container
Wait for Rancher to become ready, login to the Web Frontend, everything looks fine.
Reboot the Server
Container is in a reboot loop, docker logs looks like this:

INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 20:26:59 [INFO] Rancher version 9bb8d5674 (9bb8d5674) is starting
2023/02/02 20:26:59 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 20:26:59 [INFO] Listening on /tmp/log.sock
2023/02/02 20:26:59 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:01 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:03 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:05 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:07 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:09 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:11 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:28 [FATAL] 1 error occurred:
	* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": proxy error from 127.0.0.1:6443 while dialing 10.42.0.12:9443, code 503: 503 Service Unavailable

Do the identical tests with rancher 2.6:
After the server reboot everything works as expected: rancher comes up normally, web frontend is available after a short time, this is in docker logs:

INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 20:35:24 [INFO] Listening on /tmp/log.sock
2023/02/02 20:35:24 [INFO] Rancher version f7024783a (f7024783a) is starting
2023/02/02 20:35:24 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 20:35:24 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:26 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:28 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:30 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:32 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:34 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:36 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:38 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:55 [INFO] Running in single server mode, will not peer connections
2023/02/02 20:35:55 [INFO] Applying CRD features.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD navlinks.ui.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD clusters.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD apiservices.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD clusterregistrationtokens.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD settings.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD preferences.management.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD features.management.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD clusterrepos.catalog.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD operations.catalog.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD apps.catalog.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD fleetworkspaces.management.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD managedcharts.management.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD clusters.provisioning.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD clusters.provisioning.cattle.io
[....]

So Everything OK with 2.6, the Container “survives” a reboot… so IMHO there is some sort of bug in rancher 2.7 Docker Image, isn’t it?

bpedersen2 · March 6, 2023, 11:13am

You need to persist tghe ca-certifactes as well with 2.7, so add:

-v ...ca-certificates:/var/lib/ca-certificates

to your docker invocation.

t0mcat · March 6, 2023, 4:25pm

@bpedersen2
Thx for your reply… after this I again made an identical test with exact the same procedure described above… and now everything works just fine, I’m not able to reproduce this behaviour any more.
Even without using an external volume for ca-certificates the container survives a server reboot. I can stop/start the container, remove the container and run a complete new one… in all cases it simply works… perhaps a newer image nowadays? (Image Tag “:2.7-head” definitely pulled the latest one from docker hub)

bpedersen2 · March 6, 2023, 4:40pm

For production usage you probably want 2.7.1 ( that is latest stable release), -head is a rc for 2.7.2 ( always check github releases…)

t0mcat · March 7, 2023, 2:52pm

Absolutely… Tests with v2.7.1. also where successful, so I will use this one for production use.

Topic		Replies	Views
Fresh installation of Rancher Server fails - Container keeps restarting Rancher	4	3398	August 2, 2021
PANIC: Failed to start rancher:server Rancher	0	1130	March 17, 2020
Failed Rancher upgrade	1	84	November 21, 2024
Rancher Keeps restarting after K3s crash? Rancher	0	1091	February 11, 2022
Rancher agent failed after reboot system Rancher 1.x	6	2595	April 24, 2017

Rancher 2.7 on Docker fails start after server reboot

Related topics