Restore a cluster from etcd snapshot

#1

Hi All,

I have a “single node cluster” with all roles ( etcd + controlplane + worker ) declared in my rancher 2 server.

I have found that the cluster’s etcd snapshots are saved in /opt/rke/ , now my question is :

If I loose the node/cluster , how can I restore it from these snaphots ?

Thanks,
François.

#2

Got this question myself. In my case, my etcd is on its own node, single node, and I see the snapshots but if I reimage this machine, how do I restore from those snapshots…

Rancher 2.x documentation suggests you can only manage this scenario if you have S3 backups of them. If you have a physical copy of the snapshot how can this be done? Would it be possible to deploy a new etcd node and manually copy the snapshots into the new etcd node’s /opt/rke/etcd-snapshots folder?

#3

Answered it myself, here’s what I did-

Copied out /opt/rke/etcd-snapshots somewhere safe.

Ran my “cleanup.sh” which blows away all the rancher/rke dockers and cleans up the folders to bring the server to a clean slate. Rancher access to the cluster was a hot mess of timeouts at this point.

Eventually the rancher API server calmed down, cluster was unavailable, but I was able to “Edit Cluster” and find the cmd to create a new etcd node. I created said node, and tried to restore from snapshots - Rancher apparently still had the original list of snapshots despite /opt/rke/etcd-snapshots being empty, but when restoring from the latest, it complained it “Failed to start backup-container … blahblahblah … stat /backup/ failed”

I then re-populated /opt/rke/etcd-snapshots with data from my safe location, and it took a few minutes, including controlplane becoming unavailable in the midst, but the whole restore process occurred in the background successfully. I was able to view my workloads, showing the exact same ones as before, and hit their services successfully… that was cool.

I would suggest pre-populating /opt/rke/etcd-snapshots on your server ahead of time to shortcut that process. I’m going to try doing this again to validate that it goes smoother with pre-populated etcd-snapshots folder.

#4

Update- Looks like when you add a new etcd, it can’t really “register” with Rancher until you begin the snapshot restore process… or at least that’s how it seemed to me. Stuck in limbo trying to register etcd with rancher until I went out and “Restore Snapshot”, then I noticed the etcd docker started on my etcd host, and the restore process went underway on its own. Prepopulating /opt/rke/etcd-snapshots seemed to help a bit.