Rancher 2.2.2 certificate expiration issues

Hi,

Our rancher 2.2.2 installation (upgraded from 2.0.8 one year ago) went in error, the etcd server was not starting, complaining of an expire

We removed the indeed expired certificates from /var/rancher/lib/state-management/tls (localhost.crt and token-node.crt) and restarted the rancher container. This fixed rancher, however it now cannot connect to the single cluster it manages (it seems that rancher doesn’t have the correct credentials to the cluster)

The cluster seems alive, however we have no way to check it

Does some have an idea how to recover the communication between rancher and kubernetes ?

Thank you in advance

1 Like

I have the exact same issue with a Rancher v2.2.9 Docker installation. The localhost.crt and token-node.crt certs in /var/lib/rancher/state-management/tls have expired so the Rancher container is restarting every 11-12 seconds.

The logs show many cert errors repeated over and over until the process stops;

2020/05/07 07:15:24 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: x509: certificate has expired or is not yet valid
2020-05-07 07:15:24.815346 I | http: TLS handshake error from 127.0.0.1:43826: remote error: tls: bad certificate
2020-05-07 07:15:24.828573 I | http: TLS handshake error from 127.0.0.1:43876: remote error: tls: bad certificate
E0507 07:15:24.856329       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.ReplicaSet: Get https://localhost:6443/apis/apps/v1/replicasets?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.857714       5 reflector.go:134] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:178: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.861677       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.862446       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.863244       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
2020-05-07 07:15:24.863976 I | http: TLS handshake error from 127.0.0.1:43888: remote error: tls: bad certificate

If I set the system clock to a date in the past before the certificates expired Rancher can start up, however that’s only a viable workaround for the very, very short term.

@vincent Have you guys come across this issue before? This is the second Rancher server I’ve seen it on and I can’t find any official workaround or fix. My solution the first time was to deploy a new Rancher server and build a new cluster because I was crunched for time

A community member in Slack (not a Rancher staffer) suggested the following (NOTE: I haven’t personally tried this yet, so just a warning to others I’m not recommending it at this time, just asking a question!);

rm /etc/kubernetes/ssl/*
rm /var/lib/rancher/management-state/certs/bundle.json
rm /var/lib/rancher/management-state/tls/token-node.crt
rm /var/lib/rancher/management-state/tls/localhost.crt

Would it be possible for someone at Rancher to verify if the above is a potential solution, or perhaps suggest a safe alternative?

Just a quick update to this, I’ve done a workaround today to confirm that it fixed my problem :tada: by removing the 3 files, however the /etc/kubernetes/ssl path didn’t exist for me.

I’ve logged an Issue on GitHub with all the details;

I have the same issue ( but my rancher server version is v2.3.2 ) and tried to solve it following one of the suggested solutions from the GitHub ticket mentioned.
See my comment at https://github.com/rancher/rancher/issues/26984#issuecomment-718898677

Unfortunately my rancher server is now stuck in
2020/10/29 16:55:39 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: x509: certificate signed by unknown authority

Does anybody have an idea how to fix such rancher server issue ?

As mentioned here https://github.com/rancher/rancher/issues/26984#issuecomment-720320606
I was able to resolve my issue deleting some other files too ( /var/lib/rancher/management-state/tls ) and restarting rancher server.