After restarting one of the server nodes in HA AWS EC2 Cluster, it loses the "All" Role and the nginx ingress controller

Sarma_K · December 3, 2018, 9:16am

Hi,

May I know if there are any known issues with AWS HA Cluster related to restart on “All” role node.

I am getting into following issues:

I have an HA cluster (with 3 nodes playing “All” role) and is front-ended with an NGINX Load balancer. Now when I restarted one of the nodes, to test HA, I found the following things,

The node lost the role “All”. It shows the node as when checked from kubectl. (shows as worker on the UI). I checked the labels and the labels are not present.
The nginx controller pod is not running on this node.

I see the following error messages on the cattle-agent pod:
kubectl logs -f cattle-node-agent-hbj7p -n cattle-system
INFO: Environment: CATTLE_ADDRESS=XXXX CATTLE_AGENT_CONNECT=true CATTLE_CA_CHECKSUM=0f670bc85b079944ff18a99ec5f2b6125c4b2b798348fe77d724df89e7c723f7 CATTLE_CLUSTER=false CATTLE_INTERNAL_ADDRESS= CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=XXXXX CATTLE_SERVER=XXXXXX
INFO: Using resolv.conf: nameserver 172.28.0.2 search xxxxx.compute.internal
INFO: XXXXXX/ping is accessible
INFO: Value from XXXXXXX/v3/settings/cacerts is an x509 certificate
time=“2018-12-03T08:25:55Z” level=info msg=“Rancher agent version v2.1.1 is starting”
time=“2018-12-03T08:25:55Z” level=info msg=“Option customConfig=map[roles:[] label:map[] address:172.28.0.149 internalAddress:]”
time=“2018-12-03T08:25:55Z” level=info msg=“Option etcd=false”
time=“2018-12-03T08:25:55Z” level=info msg=“Option controlPlane=false”
time=“2018-12-03T08:25:55Z” level=info msg=“Option worker=false”
time=“2018-12-03T08:25:55Z” level=info msg=“Option requestedHostname=ip-172-28-0-149.XXXXX”
time=“2018-12-03T08:25:55Z” level=info msg=“Listening on /tmp/log.sock”
time=“2018-12-03T08:25:55Z” level=info msg=“Connecting to wss:///v3/connect with token 25kb9fwwmqprnkqbz82fklzrt6mpsjsj7bgpjhbx8bn5rdrp22j9gq”
time=“2018-12-03T08:25:55Z” level=info msg=“Connecting to proxy” url=“wss:///v3/connect”
time=“2018-12-03T08:25:55Z” level=error msg=“Failed to connect to proxy” error=“dial tcp :443: connect: connection refused”

Topic		Replies	Views
HA-Cluster not reliable [solved] Rancher 2.x	1	1841	August 17, 2018
Cattle-pods failing Rancher 2.x	2	1652	October 25, 2019
[SOLVED] HA failover not working Rancher 2.x	2	2222	May 22, 2018
Cluster api access is stuck on a missing node Rancher 2.x	4	1044	April 15, 2022
HA - UI failure with single etcd node reboot Rancher 2.x	7	951	November 19, 2019

After restarting one of the server nodes in HA AWS EC2 Cluster, it loses the "All" Role and the nginx ingress controller

Related Topics