Restore a cluster from etcd snapshot

fdelahaye-mmtt · February 21, 2019, 9:22am

Hi All,

I have a “single node cluster” with all roles ( etcd + controlplane + worker ) declared in my rancher 2 server.

I have found that the cluster’s etcd snapshots are saved in /opt/rke/ , now my question is :

If I loose the node/cluster , how can I restore it from these snaphots ?

Thanks,
François.

spirilis · April 12, 2019, 4:25pm

Got this question myself. In my case, my etcd is on its own node, single node, and I see the snapshots but if I reimage this machine, how do I restore from those snapshots…

Rancher 2.x documentation suggests you can only manage this scenario if you have S3 backups of them. If you have a physical copy of the snapshot how can this be done? Would it be possible to deploy a new etcd node and manually copy the snapshots into the new etcd node’s /opt/rke/etcd-snapshots folder?

spirilis · April 12, 2019, 4:36pm

Answered it myself, here’s what I did-

Copied out /opt/rke/etcd-snapshots somewhere safe.

Ran my “cleanup.sh” which blows away all the rancher/rke dockers and cleans up the folders to bring the server to a clean slate. Rancher access to the cluster was a hot mess of timeouts at this point.

Eventually the rancher API server calmed down, cluster was unavailable, but I was able to “Edit Cluster” and find the cmd to create a new etcd node. I created said node, and tried to restore from snapshots - Rancher apparently still had the original list of snapshots despite /opt/rke/etcd-snapshots being empty, but when restoring from the latest, it complained it “Failed to start backup-container … blahblahblah … stat /backup/ failed”

I then re-populated /opt/rke/etcd-snapshots with data from my safe location, and it took a few minutes, including controlplane becoming unavailable in the midst, but the whole restore process occurred in the background successfully. I was able to view my workloads, showing the exact same ones as before, and hit their services successfully… that was cool.

I would suggest pre-populating /opt/rke/etcd-snapshots on your server ahead of time to shortcut that process. I’m going to try doing this again to validate that it goes smoother with pre-populated etcd-snapshots folder.

spirilis · April 12, 2019, 5:04pm

Update- Looks like when you add a new etcd, it can’t really “register” with Rancher until you begin the snapshot restore process… or at least that’s how it seemed to me. Stuck in limbo trying to register etcd with rancher until I went out and “Restore Snapshot”, then I noticed the etcd docker started on my etcd host, and the restore process went underway on its own. Prepopulating /opt/rke/etcd-snapshots seemed to help a bit.

Topic		Replies	Views
Cannot restore etcd snapshot Rancher	0	591	February 21, 2020
Restore deleted custom cluster from etcd backup Rancher	1	639	December 21, 2020
How to cancel restore from snapshot? Rancher	0	1061	February 12, 2020
ETCD Restore from a snapshot Rancher	0	1215	September 3, 2020
3 node etcd cluster recovery from snapshot Rancher	6	1230	June 26, 2020

Restore a cluster from etcd snapshot

Related topics