ETCD persistence

#1

We are looking into switching to Rancher from our current Kops deployment model for ease of multi cluster management.

I have a few questions around how etcd persistence is handled in the case that partial/all of the etcd nodes are terminated. In Kops we have an EBS volume that is backing all of our etcd state and is automatically added to the master nodes on a restart. In this case the data for etcd appears to be ephemeral and upon restart is simply synced from any available node that is currently the master etcd member.

Are there any best practices for handling this in a production deployment of a rancher cluster? Is this assumption that if all of the etcd volumes are terminated for any reason then one should simple rely on the snapshot features to recover the cluster state? Is there a way to have the etcd data store backed by a persistence layer?

Thank you for your time.

1 Like
#2

You could try implementing this: https://rancher.com/docs/rancher/v2.x/en/backups/backups/ha-backups/

Hope, this helps a bit

#3

In a production environment running in HA, you should be able to reboot any of the etcd nodes, and when they come back up they will re-sync any differences with the remaining nodes. As long as you have enough surviving nodes to maintain quorum, then the one rebooted node can reconcile itself. We have several Rancher managed clusters, and we reboot the nodes regularly for maintenance and have not had any issues with etcd corruption. I believe that the etcd data is mounted as a volume on the host, so it doesn’t lose all of its data on a reboot.

And yes, you should definitely configure etcd snapshots and back them up off the hosts.