After building a new HA Rancher v1.6.14 cluster and adding K8s hosts(plane isolation). If we try roll(terminate) one of the Rancher server nodes in an ASG in order to update the AMI. The Rancher server came up fine, but we saw K8s nodes listed in a disconnected state.
Even though the Rancher server nodes have new IP addresses, when checking the
docker logs of the Rancher server container, we see that the
Cluster membership changed message(which is expected). The K8s hosts remain disconnected in GUI even though it shows there are some containers running.
After rebooting the K8s hosts, they show up as connected. Though, the K8s clusters is performing in a degraded state(
kubectl timeouts). Interestingly, we noticed that
rancher-kubernetes-agent remains in an unhealthy state and we see the following logs from the Rancher server container.
2018-03-01 00:24:22,720 ERROR [:]    [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent  count 
From the K8s nodes after new Rancher server host comes online:
What is more interesting is
docker ps -a list:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 5e6b23bfbf34 rancher/k8s:v1.7.7-rancher1 "/usr/bin/entry.sh ku" 4 hours ago Dead r-kubernetes-kubelet-3-c6e2b297 a0367533999e rancher/k8s:v1.7.7-rancher1 "/usr/bin/entry.sh ku" 42 hours ago Up 16 hours r-kubernetes-proxy-8-b8736b50 b71180afff6b rancher/healthcheck:v0.3.3 "/.r/r /rancher-entry" 42 hours ago Up 42 hours r-healthcheck-healthcheck-8-17e4d795 .....
Afterward, I recently tested an in-place upgrade of Rancher server(same
--advertise-address IP) and node agents using the following documentation: http://rancher.com/docs/rancher/v1.6/en/upgrading/#multi-nodes. Again, I experienced a similar issue. I was able to get K8s running on the cluster, but it was experiencing degraded performance.
However, after building a completely new Rancher server with a clean database, everything is working as expected. I checked the AWS ELB and we are using classic with no stickiness. So far I have be unable to do a complete in-place upgrade or upgrade the Rancher server nodes without affect the stability of the cluster.