We have a 4 node Custom cluster and having issues with the Snapshot. Rancher was reporting it could not find one of the nodes (which was not in the cluster anymore).
So we ssh’d into the cluster and removed the offending etcd member that didn’t exist (remove member <node_id>) - and all 3 remaining etcd nodes claim to be ‘healthy’ but it can’t recover (as rancher is unhappy, the 4th node never spins up so we reduced the node count). We tried rebooting the etcd leader to force a change to see if that would resolve the problem, but still not happy.
<Red Herring?> When I click on the Kubeconfig File for the cluster in Rancher, it still shows 4 nodes. 3 are the correct IP and the one that we removed</Red Herring?>
Is there something we need to do to re-configure the information that Rancher has about the cluster? I could not see a way to do this (through the UI or in searching for info).
Should we follow the guide to spin down to one node and back up again (sorry I do not have this link available) and just accept a small outage.
Or do we spin up a new cluster and get Rancher to Add Customer cluster again and then delete the offending Rancher cluster?
Would rebooting the Rancher HA help (did you turn it off and on again?) Sorry, it is getting late.
The error/messages from withing Rancher.
From existing nodes (Custom) running on rancherOS (1.5.4 and 1.5.5) 3 nodes running etcd, worker and control plane `Failed to reconcile etcd plane: Etcd plane nodes are replaced. Stopping provisioning. Please restore your cluster from backup.
Any guidance would be appreciated.