ETCD backup restore fail,

Hi,

I am using Rancher 2.4.5. 3 etcd nodes and 2 control plane.
I was trying to restore the etcd backup of a Rancher Launched Kubernetes Cluster an got the error:

Cluster health check failed: Failed to communicate with API server: Get https://10.100.192.63:6443/api/v1/namespaces/kube-system?timeout=30s: dial tcp 10.100.192.62:6443: connect: connection refused; [etcd] Failed to restore etcd snapshot: Failed to run etcd restore container, exit status is: 1, container logs: {“level”:“info”,“ts”:1597771609.581993,“caller”:“snapshot/v3_snapshot.go:287”,“msg”:“restoring snapshot”,“path”:"/opt/rke/etcd-snapshots/c-d6r59-rl-c6vv5_2020-08-17T23:34:35Z",“wal-dir”:"/opt/rke/etcd-snapshots-restore/member/wal",“data-dir”:"/opt/rke/etcd-snapshots-restore/",“snap-dir”:"/opt/rke/etcd-snapshots-restore/member/snap"} Error: snapshot missing hash but --skip-hash-check=false

I have searched the net about this and every case was the “.zip” at the end of the file which is not my case since I selected the backup from a list on the Rancher UI.
All the etcd pods are down so I can’t provide any more logs.

Regards,

Paulo Leal

I was a Rancher 1.6 k8s user and back then I was told to use a shared NFS folder to save the backup files. It worked fine back then so I assumed the same strategy on Rancher 2.4.5, so on the etcd nodes, the /opt/rke/etcd-snapshots folder mounts the same NFS folder. Is it possible that it affected the backup? Is it still possible to recover the backup?

Best regards,

Paulo Leal

Did you check the NFS share to make sure that the backup is there? You never mention that. If you have a backup somewhere, I once tricked a restore by naming another restore with one of the names in the webui. Other that that, you can try bringing up the etcd nodes via docker on the hosts and doing the restore manually via the pods.

Hi Lewis,

Yes, the file is there. There is also the decompressed version of the file.
I believe the problem is when I use the same NFS folder to save the backup files, only one version of the backup (of one of the etcd nodes) is actually saved. And when the Rancher tries to recover this backup it does not do it from the same etcd server, so the hash-check fails.
I am not sure of this. It is just something that came to my mind,

Yeah. From my experience rancher creates a backup at the same location on all etcd nodes. I never heard it suggested to make mount the same NFS share to all the nodes. It was suggested to me that if S3 was not available, to copy the files off to an NFS share. You could try a manual repair.

https://rancher.com/docs/rancher/v2.x/en/cluster-admin/restoring-etcd/#recovering-etcd-without-a-snapshot

As I said, I was a Rancher 1.6 Kubernetes user. Here they recommend to use a NFS to save backup files: https://rancher.com/docs/rancher/v1.6/en/kubernetes/backups/#configuring-remote-backups
On Rancher 2.* there is no such recommendation, but I thought it could be a good policy. It came out it is not.
Indeed that could be a good feature (equivalent to the S3 backup). If I have problems with all my etcd machines I could get the backup from my NFS and restore the cluster.