After building a new HA Rancher v1.6.14 cluster and adding K8s hosts(plane isolation). If we try roll(terminate) one of the Rancher server nodes in an ASG in order to update the AMI. The Rancher server came up fine, but we saw K8s nodes listed in a disconnected state.
Even though the Rancher server nodes have new IP addresses, when checking the docker logs
of the Rancher server container, we see that the Cluster membership changed
message(which is expected). The K8s hosts remain disconnected in GUI even though it shows there are some containers running.
After rebooting the K8s hosts, they show up as connected. Though, the K8s clusters is performing in a degraded state(kubectl
timeouts). Interestingly, we noticed that rancher-kubernetes-agent
remains in an unhealthy state and we see the following logs from the Rancher server container.
2018-03-01 00:24:22,720 ERROR [:] [] [] [] [TaskScheduler-1] [i.c.p.a.s.ping.impl.PingMonitorImpl ] Failed to get ping from agent [2] count [3]
From the K8s nodes after new Rancher server host comes online:
What is more interesting is docker ps -a
list:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5e6b23bfbf34 rancher/k8s:v1.7.7-rancher1 "/usr/bin/entry.sh ku" 4 hours ago Dead r-kubernetes-kubelet-3-c6e2b297
a0367533999e rancher/k8s:v1.7.7-rancher1 "/usr/bin/entry.sh ku" 42 hours ago Up 16 hours r-kubernetes-proxy-8-b8736b50
b71180afff6b rancher/healthcheck:v0.3.3 "/.r/r /rancher-entry" 42 hours ago Up 42 hours r-healthcheck-healthcheck-8-17e4d795
.....
Afterward, I recently tested an in-place upgrade of Rancher server(same --advertise-address
IP) and node agents using the following documentation: http://rancher.com/docs/rancher/v1.6/en/upgrading/#multi-nodes. Again, I experienced a similar issue. I was able to get K8s running on the cluster, but it was experiencing degraded performance.
However, after building a completely new Rancher server with a clean database, everything is working as expected. I checked the AWS ELB and we are using classic with no stickiness. So far I have be unable to do a complete in-place upgrade or upgrade the Rancher server nodes without affect the stability of the cluster.