V2.0.8 Cluster stuck in upgrading state

wdk · August 30, 2018, 12:18pm

We recently upgraded from Rancher v2.0.0 to 2.0.8 (stable) containing 3 clusters. One particular cluster faceplanted halfway because it went OOM. We were able to get it pass the rancher upgrade after adding memory and temporary swap. The etcd/control nodes however had a massive amount of old 2.0.0 agents running.

When we change a cluster setting, it goes into the This cluster is currently Updating mode and stops with “Cluster must have at least one etcd plane host” We can see the following in the rancher logs:

2018/08/30 11:44:26 [ERROR] cluster [c-kf8xn] provisioning: Removing host [172.16.184.45] from node lists
2018/08/30 11:44:26 [ERROR] cluster [c-kf8xn] provisioning: Removing host [172.16.184.46] from node lists
2018/08/30 11:44:26 [ERROR] cluster [c-kf8xn] provisioning: Removing host [172.16.184.44] from node lists

We have 3 nodes with etcd and control running but we see no rancher-agents active anymore on either of the control/etcd nodes:

[###~]$ kubectl get nodes
NAME              STATUS    ROLES               AGE       VERSION
REDACTED1  Ready     controlplane,etcd   107d      v1.10.1
REDACTED2   Ready     controlplane,etcd   107d      v1.10.1
REDACTED3   Ready     controlplane,etcd   107d      v1.10.1

root@REDACTED1~# docker ps -a | grep agent
2638b40fa8e2        rancher/rancher-agent:v2.0.0         "run.sh -- share-r..."   3 months ago        Exited (137) 3 hours ago                       share-mnt

I can not spin up the old rancher-agent:v2.0.0 pod anymore on either node.

Can I manually spin up the rancher-agent:v2.0.8 on these three nodes to get the cluster to see them again or do I have to try and recover the Exited 2.0.0 one?

Topic		Replies	Views
My Cluster is about to die - Need Help Rancher 1.x	2	870	November 17, 2022
Solving a Customer Cluster problem: Failed to reconcile etcd plane: Etcd plane nodes are replaced Rancher	0	2311	April 24, 2020
[ERROR] [controlPlane] Failed to upgrade Control Plane: [[host rancher-w1 not ready]] Rancher	3	6895	November 30, 2022
Cluster provision issues after rancher-server EC2 restart Rancher	0	1136	October 22, 2019
Corrupted etcd?	0	904	March 29, 2022

V2.0.8 Cluster stuck in upgrading state

Related topics