Our rancher 2.2.2 installation (upgraded from 2.0.8 one year ago) went in error, the etcd server was not starting, complaining of an expire
We removed the indeed expired certificates from /var/rancher/lib/state-management/tls (localhost.crt and token-node.crt) and restarted the rancher container. This fixed rancher, however it now cannot connect to the single cluster it manages (it seems that rancher doesn’t have the correct credentials to the cluster)
The cluster seems alive, however we have no way to check it
Does some have an idea how to recover the communication between rancher and kubernetes ?
I have the exact same issue with a Rancher v2.2.9 Docker installation. The localhost.crt and token-node.crt certs in /var/lib/rancher/state-management/tls have expired so the Rancher container is restarting every 11-12 seconds.
The logs show many cert errors repeated over and over until the process stops;
2020/05/07 07:15:24 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: x509: certificate has expired or is not yet valid
2020-05-07 07:15:24.815346 I | http: TLS handshake error from 127.0.0.1:43826: remote error: tls: bad certificate
2020-05-07 07:15:24.828573 I | http: TLS handshake error from 127.0.0.1:43876: remote error: tls: bad certificate
E0507 07:15:24.856329 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.ReplicaSet: Get https://localhost:6443/apis/apps/v1/replicasets?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.857714 5 reflector.go:134] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:178: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.861677 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.862446 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
E0507 07:15:24.863244 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid
2020-05-07 07:15:24.863976 I | http: TLS handshake error from 127.0.0.1:43888: remote error: tls: bad certificate
If I set the system clock to a date in the past before the certificates expired Rancher can start up, however that’s only a viable workaround for the very, very short term.
@vincent Have you guys come across this issue before? This is the second Rancher server I’ve seen it on and I can’t find any official workaround or fix. My solution the first time was to deploy a new Rancher server and build a new cluster because I was crunched for time
A community member in Slack (not a Rancher staffer) suggested the following (NOTE: I haven’t personally tried this yet, so just a warning to others I’m not recommending it at this time, just asking a question!);
Just a quick update to this, I’ve done a workaround today to confirm that it fixed my problem by removing the 3 files, however the /etc/kubernetes/ssl path didn’t exist for me.
I’ve logged an Issue on GitHub with all the details;
Unfortunately my rancher server is now stuck in 2020/10/29 16:55:39 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: x509: certificate signed by unknown authority
Does anybody have an idea how to fix such rancher server issue ?