Issues accessing rancher UI on PROD

Hey hey,

Sorry to be bothering but I’m having an issue on an old cluster that is still in production in which the apps are still working except the rancher UI it self. This is hosted in RancherOS and since its in PROD I was hoping for someone that actually knows what the issue is rather then me trying things around and screwing up PROD.

This is a node that I use solely for the UI that then connects to other clusters that have the same installation.

[rancher@rancher ~]$ sudo system-docker ps
CONTAINER ID        IMAGE                              COMMAND                  CREATED             STATUS              PORTS               NAMES
f021e68356e0        rancher/os-console:v1.5.6          "/usr/bin/ros entr..."   4 months ago        Up 4 months                             console
b81a1b6aac10        rancher/os-docker:19.03.11         "ros user-docker"        12 months ago       Up 4 months                             docker
058b0f8b1ebb        rancher/os-base:v1.5.6             "/usr/bin/ros entr..."   12 months ago       Up 4 months                             ntp
cc61faa9647a        rancher/os-base:v1.5.6             "/usr/bin/ros entr..."   12 months ago       Up 4 months                             network
ef17d6ffb9eb        rancher/os-base:v1.5.6             "/usr/bin/ros entr..."   12 months ago       Up 4 months                             udev
2e7b4e362c13        rancher/container-crontab:v0.4.0   "container-crontab"      12 months ago       Up 4 months                             system-cron
ee7aaf96d97d        rancher/os-syslog:v1.5.6           "/usr/bin/entrypoi..."   12 months ago       Up 4 months                             syslog
7fb2f561a810        rancher/os-acpid:v1.5.6            "/usr/bin/ros entr..."   12 months ago       Up 4 months                             acpid
[rancher@rancher ~]$ docker ps
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS                                      NAMES
261d4c9fee6e        rancher/rancher:latest   "entrypoint.sh --acm…"   10 months ago       Up 11 minutes       0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp   flamboyant_wing

The rancher/rancher:latest logs are as follows:

I1206 14:02:29.327047      38 shared_informer.go:230] Caches are synced for HPA 
W1206 14:02:29.328741      38 actual_state_of_world.go:506] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="local-node" does not exist
I1206 14:02:29.345929      38 shared_informer.go:230] Caches are synced for service account 
I1206 14:02:29.353691      38 shared_informer.go:230] Caches are synced for node 
I1206 14:02:29.353723      38 range_allocator.go:172] Starting range CIDR allocator
I1206 14:02:29.353727      38 shared_informer.go:223] Waiting for caches to sync for cidrallocator
I1206 14:02:29.353731      38 shared_informer.go:230] Caches are synced for cidrallocator 
I1206 14:02:29.357886      38 shared_informer.go:230] Caches are synced for TTL 
I1206 14:02:29.375084      38 shared_informer.go:230] Caches are synced for namespace 
I1206 14:02:29.377548      38 shared_informer.go:230] Caches are synced for certificate-csrsigning 
I1206 14:02:29.385322      38 shared_informer.go:230] Caches are synced for endpoint_slice 
I1206 14:02:29.387341      38 shared_informer.go:230] Caches are synced for GC 
I1206 14:02:29.389206      38 shared_informer.go:230] Caches are synced for deployment 
I1206 14:02:29.395017      38 shared_informer.go:230] Caches are synced for PV protection 
I1206 14:02:29.407208      38 shared_informer.go:230] Caches are synced for ReplicaSet 
I1206 14:02:29.414114      38 shared_informer.go:230] Caches are synced for certificate-csrapproving 
I1206 14:02:29.419590      38 shared_informer.go:230] Caches are synced for job 
E1206 14:02:29.424622      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:29.461912      38 shared_informer.go:230] Caches are synced for ReplicationController 
I1206 14:02:29.483138      38 shared_informer.go:230] Caches are synced for endpoint 
I1206 14:02:29.510171      38 shared_informer.go:230] Caches are synced for disruption 
I1206 14:02:29.510237      38 disruption.go:339] Sending events to api server.
E1206 14:02:29.524771      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:29.582355      38 log.go:172] http: TLS handshake error from 127.0.0.1:60144: remote error: tls: bad certificate
E1206 14:02:29.582374       7 leaderelection.go:321] error retrieving resource lock kube-system/cattle-controllers: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-system/configmaps/cattle-controllers?timeout=15m0s": x509: certificate has expired or is not yet valid: current time 2021-12-06T14:02:29Z is after 2021-11-22T16:54:55Z
I1206 14:02:29.600870      38 log.go:172] http: TLS handshake error from 127.0.0.1:60148: remote error: tls: bad certificate
E1206 14:02:29.600902      38 controller.go:136] failed to ensure node lease exists, will retry in 7s, error: Get https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/local-node?timeout=10s: x509: certificate has expired or is not yet valid
E1206 14:02:29.625587      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:29.659975      38 shared_informer.go:230] Caches are synced for ClusterRoleAggregator 
E1206 14:02:29.725745      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:29.752689      38 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach
I1206 14:02:29.760026      38 kubelet_node_status.go:70] Attempting to register node local-node
I1206 14:02:29.761301      38 log.go:172] http: TLS handshake error from 127.0.0.1:60166: remote error: tls: bad certificate
E1206 14:02:29.761345      38 kubelet_node_status.go:92] Unable to register node "local-node" with API server: Post https://127.0.0.1:6443/api/v1/nodes: x509: certificate has expired or is not yet valid
I1206 14:02:29.816976      38 shared_informer.go:230] Caches are synced for persistent volume 
E1206 14:02:29.825875      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:29.833350      38 shared_informer.go:230] Caches are synced for PVC protection 
I1206 14:02:29.881269      38 shared_informer.go:230] Caches are synced for expand 
I1206 14:02:29.891124      38 shared_informer.go:230] Caches are synced for stateful set 
I1206 14:02:29.893302      38 shared_informer.go:230] Caches are synced for attach detach 
I1206 14:02:29.911933      38 shared_informer.go:230] Caches are synced for taint 
I1206 14:02:29.912011      38 node_lifecycle_controller.go:1433] Initializing eviction metric for zone: 
W1206 14:02:29.912093      38 node_lifecycle_controller.go:1048] Missing timestamp for Node local-node. Assuming now as a timestamp.
I1206 14:02:29.912097      38 taint_manager.go:187] Starting NoExecuteTaintManager
I1206 14:02:29.912139      38 node_lifecycle_controller.go:1199] Controller detected that all Nodes are not-Ready. Entering master disruption mode.
I1206 14:02:29.912204      38 event.go:278] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"local-node", UID:"ddc88ab3-ad4e-49af-90c3-80c6609f2a45", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RegisteredNode' Node local-node event: Registered Node local-node in Controller
E1206 14:02:29.925993      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:29.955987      38 shared_informer.go:230] Caches are synced for daemon sets 
I1206 14:02:29.980826      38 log.go:172] http: TLS handshake error from 127.0.0.1:60174: remote error: tls: bad certificate
E1206 14:02:30.026129      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:30.126272      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:30.226424      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:30.326551      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:30.347250      38 request.go:621] Throttling request took 1.040385828s, request: GET:https://127.0.0.1:6444/apis/node.k8s.io/v1beta1?timeout=32s
E1206 14:02:30.426719      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:30.526860      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:30.627012      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:30.727209      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:30.800123      38 shared_informer.go:223] Waiting for caches to sync for garbage collector
E1206 14:02:30.827346      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:30.927510      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:30.973886      38 log.go:172] http: TLS handshake error from 127.0.0.1:60200: remote error: tls: bad certificate
time="2021-12-06T14:02:30.973951284Z" level=info msg="waiting for node local-node: Get https://127.0.0.1:6443/api/v1/nodes/local-node: x509: certificate has expired or is not yet valid"
I1206 14:02:30.980388      38 log.go:172] http: TLS handshake error from 127.0.0.1:60202: remote error: tls: bad certificate
E1206 14:02:31.027812      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:31.127254      38 log.go:172] http: TLS handshake error from 127.0.0.1:60208: remote error: tls: bad certificate
E1206 14:02:31.127419       7 reflector.go:128] pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.2/tools/cache/reflector.go:157: Failed to watch *summary.SummarizedObject: failed to list *summary.SummarizedObject: Get "https://127.0.0.1:6443/apis/apiregistration.k8s.io/v1/apiservices?limit=500&resourceVersion=0&timeout=15m0s": x509: certificate has expired or is not yet valid: current time 2021-12-06T14:02:31Z is after 2021-11-22T16:54:55Z
E1206 14:02:31.127947      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:31.228077      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:31.328215      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:31.428344      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:31.470413      38 log.go:172] http: TLS handshake error from 127.0.0.1:60220: remote error: tls: bad certificate
E1206 14:02:31.470456       7 reflector.go:128] pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.2/tools/cache/reflector.go:157: Failed to watch *summary.SummarizedObject: failed to list *summary.SummarizedObject: Get "https://127.0.0.1:6443/api/v1/podtemplates?resourceVersion=200290079&timeout=15m0s": x509: certificate has expired or is not yet valid: current time 2021-12-06T14:02:31Z is after 2021-11-22T16:54:55Z
E1206 14:02:31.528489      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:31.582257      38 log.go:172] http: TLS handshake error from 127.0.0.1:60222: remote error: tls: bad certificate
E1206 14:02:31.582278       7 leaderelection.go:321] error retrieving resource lock kube-system/cattle-controllers: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-system/configmaps/cattle-controllers?timeout=15m0s": x509: certificate has expired or is not yet valid: current time 2021-12-06T14:02:31Z is after 2021-11-22T16:54:55Z
E1206 14:02:31.628646      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:31.728823      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:31.828990      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:31.929138      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:31.980692      38 log.go:172] http: TLS handshake error from 127.0.0.1:60232: remote error: tls: bad certificate
E1206 14:02:32.029318      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:32.129436      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:32.215834      38 log.go:172] http: TLS handshake error from 127.0.0.1:60240: remote error: tls: bad certificate
E1206 14:02:32.229587      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:32.245322      38 log.go:172] http: TLS handshake error from 127.0.0.1:60242: remote error: tls: bad certificate
E1206 14:02:32.245571       7 reflector.go:128] pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.2/tools/cache/reflector.go:157: Failed to watch *v3.ClusterRoleTemplateBinding: failed to list *v3.ClusterRoleTemplateBinding: Get "https://127.0.0.1:6443/apis/management.cattle.io/v3/clusterroletemplatebindings?resourceVersion=200290094": x509: certificate has expired or is not yet valid: current time 2021-12-06T14:02:32Z is after 2021-11-22T16:54:55Z
E1206 14:02:32.329721      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:32.429869      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:32.530048      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:32.630198      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:32.640085      38 shared_informer.go:230] Caches are synced for resource quota 
I1206 14:02:32.700370      38 shared_informer.go:230] Caches are synced for garbage collector 
I1206 14:02:32.704422      38 shared_informer.go:230] Caches are synced for garbage collector 
I1206 14:02:32.704471      38 garbagecollector.go:142] Garbage collector: all resource monitors have synced. Proceeding to collect garbage
I1206 14:02:32.714892      38 shared_informer.go:230] Caches are synced for resource quota 
I1206 14:02:32.723345      38 log.go:172] http: TLS handshake error from 127.0.0.1:60256: remote error: tls: bad certificate
E1206 14:02:32.723417       7 reflector.go:128] pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.2/tools/cache/reflector.go:157: Failed to watch *v3.SourceCodeCredential: failed to list *v3.SourceCodeCredential: Get "https://127.0.0.1:6443/apis/project.cattle.io/v3/sourcecodecredentials?limit=500": x509: certificate has expired or is not yet valid: current time 2021-12-06T14:02:32Z is after 2021-11-22T16:54:55Z
E1206 14:02:32.730351      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:32.830488      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:32.930620      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:32.950454      38 log.go:172] http: TLS handshake error from 127.0.0.1:60258: remote error: tls: bad certificate
time="2021-12-06T14:02:32.950484448Z" level=error msg="Unable to watch for tunnel endpoints: Get https://127.0.0.1:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=0&watch=true: x509: certificate has expired or is not yet valid"
I1206 14:02:32.981698      38 log.go:172] http: TLS handshake error from 127.0.0.1:60260: remote error: tls: bad certificate
I1206 14:02:32.982220      38 log.go:172] http: TLS handshake error from 127.0.0.1:60262: remote error: tls: bad certificate
time="2021-12-06T14:02:32.982259189Z" level=info msg="waiting for node local-node: Get https://127.0.0.1:6443/api/v1/nodes/local-node: x509: certificate has expired or is not yet valid"
E1206 14:02:33.016995      38 eviction_manager.go:260] eviction manager: failed to get summary stats: failed to get node info: node "local-node" not found
E1206 14:02:33.030759      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:33.130907      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:33.139800      38 log.go:172] http: TLS handshake error from 127.0.0.1:60264: remote error: tls: bad certificate
E1206 14:02:33.139922       7 reflector.go:128] pkg/mod/github.com/rancher/client-go@v1.19.0-rancher.2/tools/cache/reflector.go:157: Failed to watch *summary.SummarizedObject: failed to list *summary.SummarizedObject: Get "https://127.0.0.1:6443/apis/monitoring.coreos.com/v1/prometheuses?limit=500&resourceVersion=0&timeout=15m0s": x509: certificate has expired or is not yet valid: current time 2021-12-06T14:02:33Z is after 2021-11-22T16:54:55Z
E1206 14:02:33.231098      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:33.331303      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:33.431495      38 kubelet.go:2268] node "local-node" not found
E1206 14:02:33.531726      38 kubelet.go:2268] node "local-node" not found
I1206 14:02:33.579713       7 leaderelection.go:278] failed to renew lease kube-system/cattle-controllers: timed out waiting for the condition
E1206 14:02:33.579824       7 leaderelection.go:297] Failed to release lock: resource name may not be empty
2021/12/06 14:02:33 [FATAL] leaderelection lost for cattle-controllers

Looking at this, I understand it is a certificate issue that is causing the container to fail, but I’m not understanding which certificate is failing and how to regen it without breaking the working clusters.

I’ve tried already to run the following and restart and it didn’t solve:

sudo docker exec -it 261d4c9fee6e sh -c "mv /var/lib/rancher/k3s/server/tls/dynamic-cert.json /var/lib/rancher/k3s/server/tls/dynamic-cert.json.v2"
sudo docker restart 261d4c9fee6e

Anyone knows how this could be fixed?

Thanks

I’ve seen a lot of posts around this issues and so far the solutions all seem to go around and disabling ntp to regenerate the certificate… Ie: [2.2.9] Rancher container restarting every 12 seconds, expired certificates · Issue #26984 · rancher/rancher · GitHub

But since mine is on RancherOS I didn’t find yet how to disable this, any clues guys?

I’ve restarted this node and for some reason docker ps is no longer responding, from the logs I see it’s hanging at “Loading containers: start.”.

I’m probably gonna consider this node dead and unrecoverable… gonna try to import the clusters into another instance of the rancher UI but due to the big version delta not sure it will even work :confused:

If anyone knows how to import a cluster that is in rke on rancherOs into a more recent rancher UI pls let me know

I couldn’t recover fully the rancherOS node that had the rancher/rancher instance.
What I ended up doing was a full backup of the data volume, booted a new machine this time using ubuntu and not rancherOS and copied the data volume there.

Installed docker fresh, and with the following command I started a new container to replace the old machine:
sudo docker run -d --restart=unless-stopped -p 80:80 -p 443:443 -v /opt/rancher:/var/lib/rancher --privileged rancher/rancher:latest --acme-domain

(Keep in mind IP’s, I don’t know the impact of changing IP’s, I booted a new machine using the same IP as the old one and stopped the old one to prevent IP collisions)

With this I now had a working container, but still wasn’t showing up on the browser so I then followed this from within the rancher/rancher container: Expired K3s certificates are not automatically rotated causing connection issues

Which things got a bit better but still I was getting errors on the console and wasn’t being able to connect to the other clusters.

Finally I found this: Rancher Docs: Rotation of Expired Webhook Certificates

Which running also from within the rancher/rancher container solved the issue.

!! For some reason this last step removed my local cluster from the UI !!

Besides the removal of the local cluster from the UI now everything seems to be working nicely, in my case the local cluster didn’t have anything anyway so it was fine by me.

Hope this helps others.

Thanks