3 node etcd cluster recovery from snapshot

michal.behun · October 3, 2019, 8:36am

Hi. We have custom cluster created from Rancher UI with 3 etcd nodes.
Automatic snapshots are created in all nodes. They have the same name, but they differ slightly in size among the nodes.
We can easily backup snapshots from all 3 etcd nodes.
My question is, from which etcd node should we use snapshot in case of recovery since they differ in size.
Thanks for answer.

Fraser_Goffin · October 8, 2019, 7:57am

How different are the sizes ? It is not unusual IME for small differences in size, and you should be able to use any of them for backup and restore. Etcd is eventually consistent so it is possible that the one you choose is not completely consistent with all others but tbh any backup is only going to be able to restore to a specific point in time so in many cases you are likely going to need to replay deployments or other changes that have occurred from the time of the last backup to recover the cluster state.

vincent · October 8, 2019, 3:17pm

etcd is not at all eventually-consistent. Strong distributed consistency is its main (and nearly only) feature as a database. A write is not committed until a majority of nodes acknowledge it, they are serialized so it is guaranteed multiple writes happened in the same order for every node, and a read is guaranteed to return a consistent value.

The backups are all taken independently at slightly different times so small variations are normal. You only really need one of them anyway, backing them all up is mostly a precaution in case one fails.

Fraser_Goffin · October 15, 2019, 9:10am

@vincent thx for the clarification re: read/write consistency for Etcd.

zbz17 · October 16, 2019, 12:43pm

@vincent

If we are restoring from snapshot, should we be doing so with only one node in the cluster? We are having major issues getting restore from snapshots to work. Most of the time, the nodes all become available however our calico-node workload keeps failing. As a result, we see failure from other workloads like coredns, coredns-autoscaler, and metrics-server.

Is there some official process for restoring from snapshot. I know that for the rancher cluster (built using rke cli), I have to restore one etcd first, then restore the cluster on that one node, and then restore the rest of the cluster. Is there a similar requirement for restoring a “Rancher Deployed Kubernetes” cluster?

This github issue is exactly what I am experiencing: https://github.com/rancher/rancher/issues/23456

Thanks so much

pamela · June 26, 2020, 4:50am

I assume this includes changes done in rancher-created clusters.
@vincent as far as I understand rancher’s etcd will also contain rancher-created clusters etcd as well, is this correct?

vincent · June 26, 2020, 5:20am

The etcd for the cluster Rancher is installed in contains info about:

global config/features (e.g. multi-cluster apps)
the Clusters that are registered to it
the users/groups/roles that should be pushed down to those clusters
a few features that are for individual clusters but not stored in them (mostly project definitions and related things like project-scoped secrets).

The standard k8s stuff (deployments, services, CRDs, etc) lives in the etcd for each individual cluster.

Topic		Replies	Views
Etcd snapshots are not consistant Rancher	0	1020	May 21, 2020
Restore a cluster from etcd snapshot Rancher	3	1713	April 12, 2019
ETCD Restore from a snapshot Rancher	0	1215	September 3, 2020
ETCD persistence Rancher	3	1187	June 14, 2019
Cannot restore etcd snapshot Rancher	0	591	February 21, 2020

3 node etcd cluster recovery from snapshot

Related topics