I am attempting to deploy Rancher HA(v2.3.0) to 3 nodes on RancherOS(1.5.4) in an AWS VPC using certificates generate by our internal CA. We are using a layer 4 NLB to loadbalance access to RancherHA. RKE deployment and Rancher installation proceed with no issues, however the cattle-cluster-agent and cattle-node-agent pods go into crashloops. The only error message from the cluster-agent pods is
INFO: Environment: CATTLE_ADDRESS=10.42.2.4 CATTLE_CA_CHECKSUM=64e928b78c0904a2093075305fad43749c91e33b391c651f2404361a4ecaf178 CATTLE_CLUSTER=true CATTLE_INTERNAL_ADDRESS= CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-84bfd8c8f9-c6h9r CATTLE_SERVER=https://*redacted*.*redacted*.com
INFO: Using resolv.conf: nameserver 10.43.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local options ndots:5
ERROR: https://redacted.redacted.com/ping is not accessible (Failed to connect to redacted.redacted.com port 443: Connection timed out)
Node-agent pods show a similar error using the VPC name server. DNS resolution is successful since we have a Route53 resolver pointing to our internal DNS servers.
AWS security groups are set to allow all egress traffic from these nodes as well as ingress from the NLB over port 80/443 and all traffic from the nodes themselves.
Any ideas on why the connection is timing out? Apologies if the above formatting is not correct.
UPDATE: This also happens with Rancher v2.2.8