We recently upgraded from Rancher v2.0.0 to 2.0.8 (stable) containing 3 clusters. One particular cluster faceplanted halfway because it went OOM. We were able to get it pass the rancher upgrade after adding memory and temporary swap. The etcd/control nodes however had a massive amount of old 2.0.0 agents running.
When we change a cluster setting, it goes into the This cluster is currently Updating mode and stops with “Cluster must have at least one etcd plane host” We can see the following in the rancher logs:
2018/08/30 11:44:26 [ERROR] cluster [c-kf8xn] provisioning: Removing host [172.16.184.45] from node lists
2018/08/30 11:44:26 [ERROR] cluster [c-kf8xn] provisioning: Removing host [172.16.184.46] from node lists
2018/08/30 11:44:26 [ERROR] cluster [c-kf8xn] provisioning: Removing host [172.16.184.44] from node lists
We have 3 nodes with etcd and control running but we see no rancher-agents active anymore on either of the control/etcd nodes:
[###~]$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
REDACTED1 Ready controlplane,etcd 107d v1.10.1
REDACTED2 Ready controlplane,etcd 107d v1.10.1
REDACTED3 Ready controlplane,etcd 107d v1.10.1
root@REDACTED1~# docker ps -a | grep agent
2638b40fa8e2 rancher/rancher-agent:v2.0.0 "run.sh -- share-r..." 3 months ago Exited (137) 3 hours ago share-mnt
I can not spin up the old rancher-agent:v2.0.0 pod anymore on either node.
Can I manually spin up the rancher-agent:v2.0.8 on these three nodes to get the cluster to see them again or do I have to try and recover the Exited 2.0.0 one?