Rancher 2.2.2 certificate expiration issues

Hi,

Our rancher 2.2.2 installation (upgraded from 2.0.8 one year ago) went in error, the etcd server was not starting, complaining of an expire

We removed the indeed expired certificates from /var/rancher/lib/state-management/tls (localhost.crt and token-node.crt) and restarted the rancher container. This fixed rancher, however it now cannot connect to the single cluster it manages (it seems that rancher doesn’t have the correct credentials to the cluster)

The cluster seems alive, however we have no way to check it

Does some have an idea how to recover the communication between rancher and kubernetes ?

Thank you in advance

1 Like

I have the exact same issue with a Rancher v2.2.9 Docker installation. The localhost.crt and token-node.crt certs in /var/lib/rancher/state-management/tls have expired so the Rancher container is restarting every 11-12 seconds.

The logs show many cert errors repeated over and over until the process stops;

2020/05/07 07:15:24 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: x509: certificate has expired or is not yet valid
2020-05-07 07:15:24.815346 I | http: TLS handshake error from 127.0.0.1:43826: remote error: tls: bad certificate
2020-05-07 07:15:24.828573 I | http: TLS handshake error from 127.0.0.1:43876: remote error: tls: bad certificate
E0507 07:15:24.856329       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.ReplicaSet: Get https://localhost:6443/apis/apps/v1/replicasets?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.857714       5 reflector.go:134] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:178: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.861677       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.862446       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.863244       5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
2020-05-07 07:15:24.863976 I | http: TLS handshake error from 127.0.0.1:43888: remote error: tls: bad certificate

If I set the system clock to a date in the past before the certificates expired Rancher can start up, however that’s only a viable workaround for the very, very short term.

@vincent Have you guys come across this issue before? This is the second Rancher server I’ve seen it on and I can’t find any official workaround or fix. My solution the first time was to deploy a new Rancher server and build a new cluster because I was crunched for time

A community member in Slack (not a Rancher staffer) suggested the following (NOTE: I haven’t personally tried this yet, so just a warning to others I’m not recommending it at this time, just asking a question!);

rm /etc/kubernetes/ssl/*
rm /var/lib/rancher/management-state/certs/bundle.json
rm /var/lib/rancher/management-state/tls/token-node.crt
rm /var/lib/rancher/management-state/tls/localhost.crt

Would it be possible for someone at Rancher to verify if the above is a potential solution, or perhaps suggest a safe alternative?

Just a quick update to this, I’ve done a workaround today to confirm that it fixed my problem :tada: by removing the 3 files, however the /etc/kubernetes/ssl path didn’t exist for me.

I’ve logged an Issue on GitHub with all the details;