ETCD backup restore fail,

paulo.leal · August 18, 2020, 5:46pm

Hi,

I am using Rancher 2.4.5. 3 etcd nodes and 2 control plane.
I was trying to restore the etcd backup of a Rancher Launched Kubernetes Cluster an got the error:

Cluster health check failed: Failed to communicate with API server: Get https://10.100.192.63:6443/api/v1/namespaces/kube-system?timeout=30s: dial tcp 10.100.192.62:6443: connect: connection refused; [etcd] Failed to restore etcd snapshot: Failed to run etcd restore container, exit status is: 1, container logs: {“level”:“info”,“ts”:1597771609.581993,“caller”:“snapshot/v3_snapshot.go:287”,“msg”:“restoring snapshot”,“path”:"/opt/rke/etcd-snapshots/c-d6r59-rl-c6vv5_2020-08-17T23:34:35Z",“wal-dir”:"/opt/rke/etcd-snapshots-restore/member/wal",“data-dir”:"/opt/rke/etcd-snapshots-restore/",“snap-dir”:"/opt/rke/etcd-snapshots-restore/member/snap"} Error: snapshot missing hash but --skip-hash-check=false

I have searched the net about this and every case was the “.zip” at the end of the file which is not my case since I selected the backup from a list on the Rancher UI.
All the etcd pods are down so I can’t provide any more logs.

Regards,

Paulo Leal

paulo.leal · August 18, 2020, 6:16pm

I was a Rancher 1.6 k8s user and back then I was told to use a shared NFS folder to save the backup files. It worked fine back then so I assumed the same strategy on Rancher 2.4.5, so on the etcd nodes, the /opt/rke/etcd-snapshots folder mounts the same NFS folder. Is it possible that it affected the backup? Is it still possible to recover the backup?

Best regards,

Paulo Leal

Lewis_Carroll · August 27, 2020, 1:21am

Did you check the NFS share to make sure that the backup is there? You never mention that. If you have a backup somewhere, I once tricked a restore by naming another restore with one of the names in the webui. Other that that, you can try bringing up the etcd nodes via docker on the hosts and doing the restore manually via the pods.

paulo.leal · August 27, 2020, 1:45am

Hi Lewis,

Yes, the file is there. There is also the decompressed version of the file.
I believe the problem is when I use the same NFS folder to save the backup files, only one version of the backup (of one of the etcd nodes) is actually saved. And when the Rancher tries to recover this backup it does not do it from the same etcd server, so the hash-check fails.
I am not sure of this. It is just something that came to my mind,

Lewis_Carroll · August 27, 2020, 12:59pm

Yeah. From my experience rancher creates a backup at the same location on all etcd nodes. I never heard it suggested to make mount the same NFS share to all the nodes. It was suggested to me that if S3 was not available, to copy the files off to an NFS share. You could try a manual repair.

https://rancher.com/docs/rancher/v2.x/en/cluster-admin/restoring-etcd/#recovering-etcd-without-a-snapshot

paulo.leal · August 27, 2020, 3:57pm

As I said, I was a Rancher 1.6 Kubernetes user. Here they recommend to use a NFS to save backup files: https://rancher.com/docs/rancher/v1.6/en/kubernetes/backups/#configuring-remote-backups
On Rancher 2.* there is no such recommendation, but I thought it could be a good policy. It came out it is not.
Indeed that could be a good feature (equivalent to the S3 backup). If I have problems with all my etcd machines I could get the backup from my NFS and restore the cluster.

Topic		Replies	Views
Fail to restore etcd snapshot Rancher	1	2109	September 11, 2020
Restore Snapshot :- unable to restore etcd snapshot using rke Rancher	0	915	March 20, 2020
Rancher Restore KO	0	1299	March 28, 2019
Cannot restore etcd snapshot Rancher	0	591	February 21, 2020
Failed to start backup server on all etcd nodes. msg="stat /backup/snapshot: no such file or directory Rancher	0	1923	May 21, 2020

ETCD backup restore fail,

Related topics