AWS HA Rancher 1.6 Server ASG shows disconnected Rancher hosts after termination

jamiebuxxx · March 8, 2018, 6:20pm

After building a new HA Rancher v1.6.14 cluster and adding K8s hosts(plane isolation). If we try roll(terminate) one of the Rancher server nodes in an ASG in order to update the AMI. The Rancher server came up fine, but we saw K8s nodes listed in a disconnected state.

Even though the Rancher server nodes have new IP addresses, when checking the docker logs of the Rancher server container, we see that the Cluster membership changed message(which is expected). The K8s hosts remain disconnected in GUI even though it shows there are some containers running.

After rebooting the K8s hosts, they show up as connected. Though, the K8s clusters is performing in a degraded state(kubectl timeouts). Interestingly, we noticed that rancher-kubernetes-agent remains in an unhealthy state and we see the following logs from the Rancher server container.

2018-03-01 00:24:22,720 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [2] count [3]

From the K8s nodes after new Rancher server host comes online:
What is more interesting is docker ps -a list:

CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS                    PORTS               NAMES
5e6b23bfbf34        rancher/k8s:v1.7.7-rancher1      "/usr/bin/entry.sh ku"   4 hours ago         Dead                                          r-kubernetes-kubelet-3-c6e2b297
a0367533999e        rancher/k8s:v1.7.7-rancher1      "/usr/bin/entry.sh ku"   42 hours ago        Up 16 hours                                   r-kubernetes-proxy-8-b8736b50
b71180afff6b        rancher/healthcheck:v0.3.3       "/.r/r /rancher-entry"   42 hours ago        Up 42 hours                                   r-healthcheck-healthcheck-8-17e4d795
.....

Afterward, I recently tested an in-place upgrade of Rancher server(same --advertise-address IP) and node agents using the following documentation: http://rancher.com/docs/rancher/v1.6/en/upgrading/#multi-nodes. Again, I experienced a similar issue. I was able to get K8s running on the cluster, but it was experiencing degraded performance.

However, after building a completely new Rancher server with a clean database, everything is working as expected. I checked the AWS ELB and we are using classic with no stickiness. So far I have be unable to do a complete in-place upgrade or upgrade the Rancher server nodes without affect the stability of the cluster.

Topic		Replies	Views
EC2 cattle hosts shows disconnected after some time running Rancher 1.x	3	1049	June 16, 2017
All AWS Hosts Disconnected on Rancher 1.3? Rancher 1.x	7	1923	January 26, 2017
Pods auto migration Rancher 1.x	0	847	October 13, 2017
Rancher/Server Outage & Reconnection Rancher 1.x	1	1172	March 16, 2016
V2.0.8 Cluster stuck in upgrading state Rancher	0	1263	August 30, 2018

AWS HA Rancher 1.6 Server ASG shows disconnected Rancher hosts after termination

Related topics