No such container kubelet

I just went through the renewal of certificates of the rancher web ui as we couldnt get onto it. We have successfully renewed the certificate on the web site of Rancher.

As part of the initial diagnostic process however we tried to restore from etcd snapshot as i recall at the time I could not see the cluster and we also renewed the kubernetes cluster certs (downstream cluster).
So we got this message

In the rancher-agent on the master /controlplane/etcd looking at the docker logs we are getting this error:

time="2021-11-01T22:21:32Z" level=info msg="Execing [/usr/bin/nsenter --mount=/proc/56751/ns/mnt -F -- /data/var.lib.docker/overlay2/918631e37526c96f1583a6c68b333dfdc5d43497ef746aa7d5fbd00e78d775ee/merged/usr/bin/share-mnt --stage2 /var/lib/kubelet /var/lib/rancher -- norun]"
time="2021-11-02T11:21:32+13:00" level=info msg="Root Directory is shared, skipping stage2"
+ docker start kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
+ sleep 2
+ docker start kubelet
Error response from daemon: {"message":"No such container: kubelet"}
Error: failed to start containers: kubelet
+ sleep 2
+ docker start kubelet

I am not quite sure why it is not starting up.

I can still see the nodes status. in the rancher ui.

Any help would be greatly appreciated. I hope that this will be sufficient. Am happy to provide more. Thanking you.

Another key piece of information is that this is running 2.4.8. We will be upgrading it, I have etcd snapshots.

Recently we have restarted all nodes in the downstream cluster…when we check the logs we see that there are 3 rancher agents on the one node that is suppose to be the etcd /controlplane.

Is there any documentation or guidance as to what i can do? Is there any extra information that is needed to get any help? It is much appreciated.

The agent is waiting for kubelet to be created so it can be started, but that won’t happen if the cluster is in Updating state. The reason why the cluster got in that state is the start of this, from there you can fix that state and then provisioning of nodes can resume.

Please describe what you executed exactly on your setup and what error you got after renewing the certificate, renewing the Rancher certificate does not renew cluster’s certificate, there is a separate process for that (Rancher Docs: Certificate Rotation)

So I guess you tried restoring an etcd snapshot, how old was that snapshot? And what are the logs from Rancher from starting that attempt til the error?

Thank you for the reply.

We managed to get the docker containers up and running again. We had to renew the rancher ui certificate and then restored from backup etcd. This got the containers up and running again. We still have some issues with the logs. I can raise this as another post, if we have no joy with it. Thank you once again for the reply.