Problem with restoring RKE cluster from ETCD snapshot

michaelkoro · April 17, 2019, 5:03pm

Hello all,
I’m trying to restore an rke cluster from an etcd snapshot, and i’m having some troubles.

I’ve followed the instructions in the docs -
https://rancher.com/docs/rancher/v2.x/en/backups/restorations/ha-restoration/

Unfortunately, after step 5 (bringing up the cluster), and rebooting the node, all the pods in my cluster seem to be stuck in a pending state.
I’ve also noticed that the calico node agent is stuck in a CrashLoopBackOff state (I’m guessing all other pods rely on the overlay network to be properly working - makes sense) - after reading the logs i discovered that the calico-node pod is getting an unauthorized response when trying to access the datastore (kubernetes).
I guess it has something to do with the service-accounts not being restored correctly, even though it seems there shouldn’t be any problems in that area, but still.

rke version - 0.1.14
hyperkube version - 1.11.5
calico version - 3.1.3
rancher version (that is deployed inside the cluster) - 2.1.4

Does anyone have an idea that might help ?

hussein.galal · April 17, 2019, 5:41pm

looks like the service account token key was invalidated for some reason, you can restore the cluster by trying to remove service account tokens for the following namespaces:

kube-system
cattle-system
ingress-nginx

kubectl get secret -n kube-system | awk ‘{ if ($3 == “kubernetes.io/service-account-token”) print “kubectl -n”, $1 ” delete secret”, $2 }’
kubectl get secret -n cattle-system | awk ‘{ if ($3 == “kubernetes.io/service-account-token”) print “kubectl -n”, $1 ” delete secret”, $2 }’
kubectl get secret -n ingress-nginx | awk ‘{ if ($3 == “kubernetes.io/service-account-token”) print “kubectl -n”, $1 ” delete secret”, $2 }’

After that try to delete the pods in these namespaces to be recreated with the right token

Fraser_Goffin · April 20, 2019, 11:27am

Cycling (and restoring) Etcd nodes is definitely the hardest part of managing a custom K8s cluster. We have a mandatory control which means we must cycle our whole production estate at least once monthly and of course any time a serious enough security vulnerability is identified by our CISO (and we need or cover off DR). Etcd is always the problem child since automation here is more problematic than the other node types. We have a number of use cases that demonstrate the kinds of things we see, which we have logged as a support ticket with Rancher. We also have a site engineer coming to us this week who will hopefully get the bottom of our issues for us. Undoubtedly some of these will be specific to our environment, but I suspect most are not. When we get some positive answers I’ll post them back here.

In the mean time, be aware that Etcd backup and restore is a little different with v2.2.2 if you have moved to that version, and if not, keep that in mind for when you do.

Topic		Replies	Views
Restore Snapshot :- unable to restore etcd snapshot using rke Rancher	0	915	March 20, 2020
How to cancel restore from snapshot? Rancher	0	1061	February 12, 2020
Restore Snapshot - etcd nodes are unavailable and calico-node not running Rancher	0	1033	October 16, 2019
ETCD backup restore fail, Rancher	5	2315	August 27, 2020
Restore cluster after total loss	0	626	December 3, 2020

Problem with restoring RKE cluster from ETCD snapshot

Related topics