3 node etcd cluster recovery from snapshot

Hi. We have custom cluster created from Rancher UI with 3 etcd nodes.
Automatic snapshots are created in all nodes. They have the same name, but they differ slightly in size among the nodes.
We can easily backup snapshots from all 3 etcd nodes.
My question is, from which etcd node should we use snapshot in case of recovery since they differ in size.
Thanks for answer.

How different are the sizes ? It is not unusual IME for small differences in size, and you should be able to use any of them for backup and restore. Etcd is eventually consistent so it is possible that the one you choose is not completely consistent with all others but tbh any backup is only going to be able to restore to a specific point in time so in many cases you are likely going to need to replay deployments or other changes that have occurred from the time of the last backup to recover the cluster state.

etcd is not at all eventually-consistent. Strong distributed consistency is its main (and nearly only) feature as a database. A write is not committed until a majority of nodes acknowledge it, they are serialized so it is guaranteed multiple writes happened in the same order for every node, and a read is guaranteed to return a consistent value.

The backups are all taken independently at slightly different times so small variations are normal. You only really need one of them anyway, backing them all up is mostly a precaution in case one fails.

@vincent thx for the clarification re: read/write consistency for Etcd.

@vincent

If we are restoring from snapshot, should we be doing so with only one node in the cluster? We are having major issues getting restore from snapshots to work. Most of the time, the nodes all become available however our calico-node workload keeps failing. As a result, we see failure from other workloads like coredns, coredns-autoscaler, and metrics-server.

Is there some official process for restoring from snapshot. I know that for the rancher cluster (built using rke cli), I have to restore one etcd first, then restore the cluster on that one node, and then restore the rest of the cluster. Is there a similar requirement for restoring a “Rancher Deployed Kubernetes” cluster?

This github issue is exactly what I am experiencing: https://github.com/rancher/rancher/issues/23456

Thanks so much

I assume this includes changes done in rancher-created clusters.
@vincent as far as I understand rancher’s etcd will also contain rancher-created clusters etcd as well, is this correct?

The etcd for the cluster Rancher is installed in contains info about:

  • global config/features (e.g. multi-cluster apps)
  • the Clusters that are registered to it
  • the users/groups/roles that should be pushed down to those clusters
  • a few features that are for individual clusters but not stored in them (mostly project definitions and related things like project-scoped secrets).

The standard k8s stuff (deployments, services, CRDs, etc) lives in the etcd for each individual cluster.

1 Like