Another expired certs issue

Garry.Cyre · June 29, 2021, 2:53pm

Hi All

I’m running Rancher 2.4.5 and recently ran into the expired certificates issue. The UI is not accessible and there are error messages in the logs pointing to expired certs.

I’ve followed a number of solutions online

github.com/rancher/rancher

[2.2.9] Rancher container restarting every 12 seconds, expired certificates

opened 07:36AM - 08 May 20 UTC

closed 02:41PM - 14 Apr 22 UTC

justincarter

status/stale

**What kind of request is this (question/bug/enhancement/feature request):** …Bug **Steps to reproduce (least amount of steps as possible):** - Install Rancher v2.0.0, upgrade to v2.0.2 -> v2.0.4 -> v2.0.8 - Upgrade to v2.1.6 - One year after Rancher v2.0.0 was installed, certificates expire and cluster becomes "unavailable" - Upgrade to v2.1.9; did not fix certificate expiry/rotation issue - Upgrade to v2.2.2, certificated rotated and cluster is available again, everything working - One year after Rancher v2.2.2 was installed, the Rancher Server UI become unavailable due to the container restarting every 12 seconds - Perform a backup of /var/lib/rancher, two certs inside the backup are expired and Rancher does not auto renew them; - /var/rancher/lib/state-management/tls/localhost.crt - /var/rancher/lib/state-management/tls/token-node.crt (I think you could simulate the above timeline by setting the system clock to a date in the past and then moving it forward at the appropriate time to reproduce a ~1 year jump). **Result:** Running Rancher v2.2.9 as a single Docker container install, the Rancher Server UI becomes unavailable ("connection refused" in the browser) and the container is restarting every 12 seconds. Rancher is unusable. **Environment information** - Rancher version (`rancher/rancher`/`rancher/server` image tag or shown bottom left in the UI): rancher/rancher v2.2.9 - Installation option (single install/HA): Single install (Docker container) **Possible Workarounds:** **_Workaround 1)_** Set the system clock to a date in the past so that the certificate is not seen as expired. For me, on an Ubuntu server, that was achievable by disabling NTP and then setting the date and time manually; ``` sudo timedatectl set-ntp off sudo date --set="2020-05-05 09:03:00.000" ``` This allowed the container to start up correctly and the Rancher Server UI was usable again, but this is only a short term workaround at best. **_Workaround 2)_** **NOTE:** I'm *not* advocating anyone use these commands on their particular installation, I'm just providing it as feedback for review by Rancher staff, because for me it solved the issue I was having... This workaround was suggested to me by a community member on Rancher's Slack. ``` rm /etc/kubernetes/ssl/* rm /var/lib/rancher/management-state/certs/bundle.json rm /var/lib/rancher/management-state/tls/token-node.crt rm /var/lib/rancher/management-state/tls/localhost.crt ``` Inside the rancher container I did not have a `/etc/kubernetes/ssl` directory so I could not run that first command. The other three files did exist (and were originally visible inside the backup of `/var/lib/rancher`). Actual command I ran to remove the files (NOTE: again, please don't take this as advice, I'm just providing it for reference); ``` sudo docker exec -it acd7 sh -c "rm /var/lib/rancher/management-state/certs/bundle.json; rm /var/lib/rancher/management-state/tls/token-node.crt; rm /var/lib/rancher/management-state/tls/localhost.crt" ``` Then I enabled NTP again with `sudo timedatectl set-ntp on` to set the system clock back to the real/current time, and restarted the container with `sudo docker restart acd7`. Rancher started up correctly and was available again, clusters were visible (two AWS EC2 clusters attached to this server). **Other details that may be helpful:** **Images on server** ``` $ sudo docker images REPOSITORY TAG IMAGE ID CREATED SIZE busybox latest 020584afccce 6 months ago 1.22 MB rancher/rancher v2.2.9 944b5893d458 6 months ago 483 MB rancher/rancher v2.1.9 9a79850e485c 12 months ago 541 MB rancher/rancher v2.2.2 cb5cf64e84cc 12 months ago 495 MB alpine latest caf27325b298 15 months ago 5.53 MB rancher/rancher v2.1.6 d14ff1038a54 15 months ago 542 MB rancher/rancher v2.0.8 817b51fbc1fc 20 months ago 529 MB rancher/rancher v2.0.4 975f0d475e47 22 months ago 530 MB rancher/rancher v2.0.2 88526c7bea4e 23 months ago 521 MB rancher/rancher v2.0.0 3141e5c66ee8 2 years ago 535 MB ``` **Rancher Logs** When the problem first occuredRancher starts up then shows many "bad certificate"/"certificate has expired or is not yet valid" errors; ``` 2020/05/07 07:15:22 [INFO] Rancher version v2.2.9 is starting 2020/05/07 07:15:22 [INFO] Rancher arguments {ACMEDomains:[redacted] AddLocal:auto Embedded:false KubeConfig: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false NoCACerts:false ListenConfig:<nil> AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0} 2020/05/07 07:15:22 [INFO] Listening on /tmp/log.sock 2020/05/07 07:15:22 [INFO] Running etcd --data-dir=management-state/etcd ... I0507 07:15:24.805853 5 naming_controller.go:284] Starting NamingConditionController I0507 07:15:24.805873 5 establishing_controller.go:73] Starting EstablishingController 2020/05/07 07:15:24 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: x509: certificate has expired or is not yet valid 2020-05-07 07:15:24.815346 I | http: TLS handshake error from 127.0.0.1:43826: remote error: tls: bad certificate 2020-05-07 07:15:24.828573 I | http: TLS handshake error from 127.0.0.1:43876: remote error: tls: bad certificate E0507 07:15:24.856329 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.ReplicaSet: Get https://localhost:6443/apis/apps/v1/replicasets?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid E0507 07:15:24.857714 5 reflector.go:134] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:178: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid E0507 07:15:24.861677 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid E0507 07:15:24.862446 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid E0507 07:15:24.863244 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid 2020-05-07 07:15:24.863976 I | http: TLS handshake error from 127.0.0.1:43888: remote error: tls: bad certificate E0507 07:15:24.864317 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.ReplicationController: Get https://localhost:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid ... E0507 07:15:33.926893 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.StorageClass: Get https://localhost:6443/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid 2020-05-07 07:15:33.926916 I | http: TLS handshake error from 127.0.0.1:44320: remote error: tls: bad certificate E0507 07:15:33.932574 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid 2020-05-07 07:15:33.932599 I | http: TLS handshake error from 127.0.0.1:44324: remote error: tls: bad certificate 2020-05-07 07:15:34.822709 I | http: TLS handshake error from 127.0.0.1:44328: remote error: tls: bad certificate 2020-05-07 07:15:34.825263 I | http: TLS handshake error from 127.0.0.1:44332: remote error: tls: bad certificate F0507 07:15:34.825392 5 controllermanager.go:184] error building controller context: failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: Get https://localhost:6443/healthz?timeout=32s: x509: certificate has expired or is not yet valid ``` I also have a copy of the logs showing the first start up after Workaround 2 above was performed, I can provide this on request if needed.

I’m able to bring the UI back up but the certificate is still invalid
invalid_cert

I cannot connect to the cluster using kubectl

kubectl --kubeconfig=config  get nodes -o wide
Unable to connect to the server: x509: certificate has expired or is not yet valid: current time 2021-06-29T10:50:55-04:00 is after 2021-06-26T20:11:34Z

There is a managed cluster hosted in Rancher that is stuck in “Updating” but the cluster is accessible by end users and is able to host apps.

How can I update this certificate?

thanks
Garry

eabellom · July 10, 2021, 3:42am

I am with the same problem, and with the same version. After 1 year running OK, this same problem occurred. Try this utility How to change Rancher 2.x server-url
, with this “bash rancher-single-tool.sh -t’upgrade ‘-r’ - acme-domain newhostname .company.com '”, trying to force the same domain, but when restarting it tells me " [INFO] Waiting for k3s to start ", and that in a worse state … You have been able to solve it ?

eabellom · July 12, 2021, 8:04am

with this it is resolved

github.com/rancher/rancher

[2.2.9] Rancher container restarting every 12 seconds, expired certificates

opened 07:36AM - 08 May 20 UTC

justincarter

**What kind of request is this (question/bug/enhancement/feature request):** …Bug **Steps to reproduce (least amount of steps as possible):** - Install Rancher v2.0.0, upgrade to v2.0.2 -> v2.0.4 -> v2.0.8 - Upgrade to v2.1.6 - One year after Rancher v2.0.0 was installed, certificates expire and cluster becomes "unavailable" - Upgrade to v2.1.9; did not fix certificate expiry/rotation issue - Upgrade to v2.2.2, certificated rotated and cluster is available again, everything working - One year after Rancher v2.2.2 was installed, the Rancher Server UI become unavailable due to the container restarting every 12 seconds - Perform a backup of /var/lib/rancher, two certs inside the backup are expired and Rancher does not auto renew them; - /var/rancher/lib/state-management/tls/localhost.crt - /var/rancher/lib/state-management/tls/token-node.crt (I think you could simulate the above timeline by setting the system clock to a date in the past and then moving it forward at the appropriate time to reproduce a ~1 year jump). **Result:** Running Rancher v2.2.9 as a single Docker container install, the Rancher Server UI becomes unavailable ("connection refused" in the browser) and the container is restarting every 12 seconds. Rancher is unusable. **Environment information** - Rancher version (`rancher/rancher`/`rancher/server` image tag or shown bottom left in the UI): rancher/rancher v2.2.9 - Installation option (single install/HA): Single install (Docker container) **Possible Workarounds:** **_Workaround 1)_** Set the system clock to a date in the past so that the certificate is not seen as expired. For me, on an Ubuntu server, that was achievable by disabling NTP and then setting the date and time manually; ``` sudo timedatectl set-ntp off sudo date --set="2020-05-05 09:03:00.000" ``` This allowed the container to start up correctly and the Rancher Server UI was usable again, but this is only a short term workaround at best. **_Workaround 2)_** **NOTE:** I'm *not* advocating anyone use these commands on their particular installation, I'm just providing it as feedback for review by Rancher staff, because for me it solved the issue I was having... This workaround was suggested to me by a community member on Rancher's Slack. ``` rm /etc/kubernetes/ssl/* rm /var/lib/rancher/management-state/certs/bundle.json rm /var/lib/rancher/management-state/tls/token-node.crt rm /var/lib/rancher/management-state/tls/localhost.crt ``` Inside the rancher container I did not have a `/etc/kubernetes/ssl` directory so I could not run that first command. The other three files did exist (and were originally visible inside the backup of `/var/lib/rancher`). Actual command I ran to remove the files (NOTE: again, please don't take this as advice, I'm just providing it for reference); ``` sudo docker exec -it acd7 sh -c "rm /var/lib/rancher/management-state/certs/bundle.json; rm /var/lib/rancher/management-state/tls/token-node.crt; rm /var/lib/rancher/management-state/tls/localhost.crt" ``` Then I enabled NTP again with `sudo timedatectl set-ntp on` to set the system clock back to the real/current time, and restarted the container with `sudo docker restart acd7`. Rancher started up correctly and was available again, clusters were visible (two AWS EC2 clusters attached to this server). **Other details that may be helpful:** **Images on server** ``` $ sudo docker images REPOSITORY TAG IMAGE ID CREATED SIZE busybox latest 020584afccce 6 months ago 1.22 MB rancher/rancher v2.2.9 944b5893d458 6 months ago 483 MB rancher/rancher v2.1.9 9a79850e485c 12 months ago 541 MB rancher/rancher v2.2.2 cb5cf64e84cc 12 months ago 495 MB alpine latest caf27325b298 15 months ago 5.53 MB rancher/rancher v2.1.6 d14ff1038a54 15 months ago 542 MB rancher/rancher v2.0.8 817b51fbc1fc 20 months ago 529 MB rancher/rancher v2.0.4 975f0d475e47 22 months ago 530 MB rancher/rancher v2.0.2 88526c7bea4e 23 months ago 521 MB rancher/rancher v2.0.0 3141e5c66ee8 2 years ago 535 MB ``` **Rancher Logs** When the problem first occuredRancher starts up then shows many "bad certificate"/"certificate has expired or is not yet valid" errors; ``` 2020/05/07 07:15:22 [INFO] Rancher version v2.2.9 is starting 2020/05/07 07:15:22 [INFO] Rancher arguments {ACMEDomains:[redacted] AddLocal:auto Embedded:false KubeConfig: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false NoCACerts:false ListenConfig:<nil> AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0} 2020/05/07 07:15:22 [INFO] Listening on /tmp/log.sock 2020/05/07 07:15:22 [INFO] Running etcd --data-dir=management-state/etcd ... I0507 07:15:24.805853 5 naming_controller.go:284] Starting NamingConditionController I0507 07:15:24.805873 5 establishing_controller.go:73] Starting EstablishingController 2020/05/07 07:15:24 [INFO] Waiting for server to become available: Get https://localhost:6443/version?timeout=30s: x509: certificate has expired or is not yet valid 2020-05-07 07:15:24.815346 I | http: TLS handshake error from 127.0.0.1:43826: remote error: tls: bad certificate 2020-05-07 07:15:24.828573 I | http: TLS handshake error from 127.0.0.1:43876: remote error: tls: bad certificate E0507 07:15:24.856329 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.ReplicaSet: Get https://localhost:6443/apis/apps/v1/replicasets?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid E0507 07:15:24.857714 5 reflector.go:134] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:178: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=status.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid E0507 07:15:24.861677 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid E0507 07:15:24.862446 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolume: Get https://localhost:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid E0507 07:15:24.863244 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.PersistentVolumeClaim: Get https://localhost:6443/api/v1/persistentvolumeclaims?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid 2020-05-07 07:15:24.863976 I | http: TLS handshake error from 127.0.0.1:43888: remote error: tls: bad certificate E0507 07:15:24.864317 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.ReplicationController: Get https://localhost:6443/api/v1/replicationcontrollers?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid ... E0507 07:15:33.926893 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.StorageClass: Get https://localhost:6443/apis/storage.k8s.io/v1/storageclasses?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid 2020-05-07 07:15:33.926916 I | http: TLS handshake error from 127.0.0.1:44320: remote error: tls: bad certificate E0507 07:15:33.932574 5 reflector.go:134] k8s.io/client-go/informers/factory.go:127: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: x509: certificate has expired or is not yet valid 2020-05-07 07:15:33.932599 I | http: TLS handshake error from 127.0.0.1:44324: remote error: tls: bad certificate 2020-05-07 07:15:34.822709 I | http: TLS handshake error from 127.0.0.1:44328: remote error: tls: bad certificate 2020-05-07 07:15:34.825263 I | http: TLS handshake error from 127.0.0.1:44332: remote error: tls: bad certificate F0507 07:15:34.825392 5 controllermanager.go:184] error building controller context: failed to wait for apiserver being healthy: timed out waiting for the condition: failed to get apiserver /healthz status: Get https://localhost:6443/healthz?timeout=32s: x509: certificate has expired or is not yet valid ``` I also have a copy of the logs showing the first start up after Workaround 2 above was performed, I can provide this on request if needed.

Topic		Replies	Views
Rancher 2.2.2 certificate expiration issues Rancher	5	10177	March 8, 2023
X509 certificate has expired or is not yet valid Rancher	13	26257	October 19, 2022
(Urgent) Unable to connect to the server: x509: certificate has expired or is not yet valid Rancher	1	1869	November 23, 2022
Unable to update expired certs gui Rancher	2	1299	March 30, 2022
Expired certs, not rotating after upgrade Rancher	2	533	October 18, 2019

Another expired certs issue

Related topics