Problem with restoring RKE cluster from ETCD snapshot

#1

Hello all,
I’m trying to restore an rke cluster from an etcd snapshot, and i’m having some troubles.

I’ve followed the instructions in the docs -
https://rancher.com/docs/rancher/v2.x/en/backups/restorations/ha-restoration/

Unfortunately, after step 5 (bringing up the cluster), and rebooting the node, all the pods in my cluster seem to be stuck in a pending state.
I’ve also noticed that the calico node agent is stuck in a CrashLoopBackOff state (I’m guessing all other pods rely on the overlay network to be properly working - makes sense) - after reading the logs i discovered that the calico-node pod is getting an unauthorized response when trying to access the datastore (kubernetes).
I guess it has something to do with the service-accounts not being restored correctly, even though it seems there shouldn’t be any problems in that area, but still.

rke version - 0.1.14
hyperkube version - 1.11.5
calico version - 3.1.3
rancher version (that is deployed inside the cluster) - 2.1.4

Does anyone have an idea that might help ?

#2

looks like the service account token key was invalidated for some reason, you can restore the cluster by trying to remove service account tokens for the following namespaces:

kube-system
cattle-system
ingress-nginx

kubectl get secret -n kube-system | awk ‘{ if ($3 == “kubernetes.io/service-account-token”) print “kubectl -n”, $1 ” delete secret”, $2 }’
kubectl get secret -n cattle-system | awk ‘{ if ($3 == “kubernetes.io/service-account-token”) print “kubectl -n”, $1 ” delete secret”, $2 }’
kubectl get secret -n ingress-nginx | awk ‘{ if ($3 == “kubernetes.io/service-account-token”) print “kubectl -n”, $1 ” delete secret”, $2 }’

After that try to delete the pods in these namespaces to be recreated with the right token

#3

Cycling (and restoring) Etcd nodes is definitely the hardest part of managing a custom K8s cluster. We have a mandatory control which means we must cycle our whole production estate at least once monthly and of course any time a serious enough security vulnerability is identified by our CISO (and we need or cover off DR). Etcd is always the problem child since automation here is more problematic than the other node types. We have a number of use cases that demonstrate the kinds of things we see, which we have logged as a support ticket with Rancher. We also have a site engineer coming to us this week who will hopefully get the bottom of our issues for us. Undoubtedly some of these will be specific to our environment, but I suspect most are not. When we get some positive answers I’ll post them back here.

In the mean time, be aware that Etcd backup and restore is a little different with v2.2.2 if you have moved to that version, and if not, keep that in mind for when you do.