After that we created a Custom K8S Cluster with rancher, everything just works fine for days.
Today we had to reboot the VM with the Rancher Docker Container. After the reboot the container fails to start and enters a restart loop. This is the output from docker logs:
INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 19:05:49 [INFO] Rancher version 9bb8d5674 (9bb8d5674) is starting
2023/02/02 19:05:49 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 19:05:49 [INFO] Listening on /tmp/log.sock
2023/02/02 19:05:49 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:05:51 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:05:53 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:05:55 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:06:16 [FATAL] 2 errors occurred:
* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": proxy error from 127.0.0.1:6443 while dialing 10.42.0.12:9443, code 503: 503 Service Unavailable
* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": context deadline exceeded
INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 19:06:35 [INFO] Rancher version 9bb8d5674 (9bb8d5674) is starting
2023/02/02 19:06:35 [INFO] Listening on /tmp/log.sock
2023/02/02 19:06:35 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 19:06:35 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:06:37 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:06:39 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:06:41 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:02 [FATAL] 2 errors occurred:
* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": proxy error from 127.0.0.1:6443 while dialing 10.42.0.12:9443, code 503: 503 Service Unavailable
* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": context deadline exceeded
INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 19:07:20 [INFO] Rancher version 9bb8d5674 (9bb8d5674) is starting
2023/02/02 19:07:20 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 19:07:20 [INFO] Listening on /tmp/log.sock
2023/02/02 19:07:20 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:22 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:24 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:26 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:28 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 19:07:30 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
Has anybody an idea whats going wrong here?
Thx in advance
T0mcat
Take the above command for spinning up a rancher 2.7 container
Wait for Rancher to become ready, login to the Web Frontend, everything looks fine.
Reboot the Server
Container is in a reboot loop, docker logs looks like this:
INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 20:26:59 [INFO] Rancher version 9bb8d5674 (9bb8d5674) is starting
2023/02/02 20:26:59 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 20:26:59 [INFO] Listening on /tmp/log.sock
2023/02/02 20:26:59 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:01 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:03 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:05 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:07 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:09 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:11 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:27:28 [FATAL] 1 error occurred:
* Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": proxy error from 127.0.0.1:6443 while dialing 10.42.0.12:9443, code 503: 503 Service Unavailable
Do the identical tests with rancher 2.6:
After the server reboot everything works as expected: rancher comes up normally, web frontend is available after a short time, this is in docker logs:
INFO: Running k3s server --cluster-init --cluster-reset
2023/02/02 20:35:24 [INFO] Listening on /tmp/log.sock
2023/02/02 20:35:24 [INFO] Rancher version f7024783a (f7024783a) is starting
2023/02/02 20:35:24 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2023/02/02 20:35:24 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:26 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:28 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:30 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:32 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:34 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:36 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:38 [INFO] Waiting for server to become available: Get "https://127.0.0.1:6444/version?timeout=15m0s": dial tcp 127.0.0.1:6444: connect: connection refused
2023/02/02 20:35:55 [INFO] Running in single server mode, will not peer connections
2023/02/02 20:35:55 [INFO] Applying CRD features.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD navlinks.ui.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD clusters.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD apiservices.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD clusterregistrationtokens.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD settings.management.cattle.io
2023/02/02 20:36:14 [INFO] Applying CRD preferences.management.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD features.management.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD clusterrepos.catalog.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD operations.catalog.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD apps.catalog.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD fleetworkspaces.management.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD managedcharts.management.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD clusters.provisioning.cattle.io
2023/02/02 20:36:15 [INFO] Applying CRD clusters.provisioning.cattle.io
[....]
So Everything OK with 2.6, the Container “survives” a reboot… so IMHO there is some sort of bug in rancher 2.7 Docker Image, isn’t it?
@bpedersen2
Thx for your reply… after this I again made an identical test with exact the same procedure described above… and now everything works just fine, I’m not able to reproduce this behaviour any more.
Even without using an external volume for ca-certificates the container survives a server reboot. I can stop/start the container, remove the container and run a complete new one… in all cases it simply works… perhaps a newer image nowadays? (Image Tag “:2.7-head” definitely pulled the latest one from docker hub)