Hello community,
we have RKE cluster of 3 nodes with all roles provisioned via vSphere cloud provider.
The cluster is for development, so all nodes live on single physical host.
Today we decided to upgrade vSphere on the host so needed to shut all VMs down including k8s cluster nodes.
We powered up Rancher and cluster nodes, but the cluster didn’t came up succesfully. After some investigation it appeared that network issue took place: DHCP lease time was 2 minutes (for specific reason which is out of scope of the topic), so IP addresses of nodes got messed up and there was no connectivity because SSL certificates of services on each node don’t have SANs for IPs on adjacent nodes (if we understood correctly reading logs).
But before we realized the reason of cluster instability we activated cluster restore from latest snapshot. After we fixed network issue the cluster became almost heathy, but it looks like etcd containers are affected by initiated restore and now we have etcd-Serve-backup and etcd-download-backup containers spawning over and over again on the nodes and Updating status of the cluster with alert saying that one of etcd component is unhealthy in web UI.
Please suggest correct way to stop restore process and advise if it will help to get all etcd instances healthy. Or will it be easier/faster to just recreate the cluster?
TIA