ETCD persistence

Matthew_Ingle · April 2, 2019, 1:44am

We are looking into switching to Rancher from our current Kops deployment model for ease of multi cluster management.

I have a few questions around how etcd persistence is handled in the case that partial/all of the etcd nodes are terminated. In Kops we have an EBS volume that is backing all of our etcd state and is automatically added to the master nodes on a restart. In this case the data for etcd appears to be ephemeral and upon restart is simply synced from any available node that is currently the master etcd member.

Are there any best practices for handling this in a production deployment of a rancher cluster? Is this assumption that if all of the etcd volumes are terminated for any reason then one should simple rely on the snapshot features to recover the cluster state? Is there a way to have the etcd data store backed by a persistence layer?

Thank you for your time.

damlub · April 9, 2019, 8:48am

You could try implementing this: https://rancher.com/docs/rancher/v2.x/en/backups/backups/ha-backups/

Hope, this helps a bit

shubbard343 · April 10, 2019, 4:30pm

In a production environment running in HA, you should be able to reboot any of the etcd nodes, and when they come back up they will re-sync any differences with the remaining nodes. As long as you have enough surviving nodes to maintain quorum, then the one rebooted node can reconcile itself. We have several Rancher managed clusters, and we reboot the nodes regularly for maintenance and have not had any issues with etcd corruption. I believe that the etcd data is mounted as a volume on the host, so it doesn’t lose all of its data on a reboot.

And yes, you should definitely configure etcd snapshots and back them up off the hosts.

Fraser_Goffin · June 14, 2019, 7:17pm

There are many options. For example, OOTB Rancher can be configured to create recurring snapshots and upload them to S3. Recovering in the case of a complete failure of your cluster would then be a matter of downloading the backup onto a single node and running rke snapshot-restore … In our case we backup to S3 using a K8s CronJob (we prefer managing this ourselves because there are particular aspects which Rancher’s solution doesn’t support) and have a CD pipeline that handles a cluster recovery. In the case of a single node failure it is even easier since, as was mentioned above, once quorum is achieved etcd will sync itself.

Topic		Replies	Views
3 node etcd cluster recovery from snapshot Rancher	6	1230	June 26, 2020
Restore a cluster from etcd snapshot Rancher	3	1713	April 12, 2019
Unable to restore complete Rancher data/configs using etcd snapshot Rancher	0	418	June 14, 2020
Rancher 2.0 backup Rancher 2.0 Tech Preview	13	5418	December 17, 2018
Cannot restore etcd snapshot Rancher	0	591	February 21, 2020

ETCD persistence

Related topics