Hello everyone,
I’m troubleshooting an issue with my RKE2 cluster setup in High Availability (HA) mode and would greatly appreciate your insights. Here’s an updated description of the problem:
Environment:
**Cluster Setup**:
-
RKE2 in HA mode with 3 master nodes and 3 agent nodes
-
1 NGINX VM acting as a load balancer.
Current state:
All my master and worker nodes are successfully deployed and join the cluster, and register against the loadbalancer using its IP.
master01 Ready control-plane,etcd,master 16h v1.29.12+rke2r1
master02 Ready control-plane,etcd,master 15h v1.29.12+rke2r1
master03 Ready control-plane,etcd,master 15h v1.29.12+rke2r1
worker01 Ready <none> 16h v1.29.12+rke2r1
worker02 Ready <none> 16h v1.29.12+rke2r1
worker03 Ready <none> 15h v1.29.12+rke2r1
-
The cluster was deployed with the Ansible role: Ansible LabLabs rke2
.
Issues:
-
Persistent kube-proxy errors across all master nodes
-
Network-related “no route to host” errors in the logs.
Logs from all Master Nodes:
-
- Kube-proxy Errors
E0121 16:37:29.216029 1 server.go:1039] "Failed to retrieve node info" err="Get \"https://127.0.0.1:6443/api/v1/nodes/master01\": dial tcp 127.0.0.1:6443: connect: connection refused"
E0121 16:37:30.319575 1 server.go:1039] "Failed to retrieve node info" err="Get \"https://127.0.0.1:6443/api/v1/nodes/master01\": dial tcp 127.0.0.1:6443: connect: connection refused"
- RKE2 Server “No Route to Host” Errors by executing systemctl status rke2-server:
Jan 22 09:26:56 master01 rke2[752364]: time="2025-01-22T09:26:56+01:00" level=error msg="Sending HTTP/1.1 502 response to 127.0.0.1:52638: dial tcp 10.42.2.3:10250: connect: no route to host"
Jan 22 09:26:56 master01 rke2[752364]: time="2025-01-22T09:26:56+01:00" level=error msg="Sending HTTP/1.1 502 response to 127.0.0.1:52640: dial tcp 10.42.2.3:10250: connect: no route to host"
Jan 22 09:26:58 master01 rke2[752364]: time="2025-01-22T09:26:58+01:00" level=error msg="Sending HTTP/1.1 502 response to 127.0.0.1:52678: dial tcp 10.42.2.3:10250: connect: no route to host"
Observations:
- Connection Issues Across Masters: The
kube-proxy
errors appear on all master nodes, indicating trouble connecting to the API server (127.0.0.1:6443
) during initialization - No Route to Host Errors: All master nodes log “no route to host” errors when attempting to connect to a specific node IP (
10.42.2.3:10250
Questions:
-
In an HA RKE2 setup, should
kube-proxy
requests always route through the load balancer when accessing the API server (127.0.0.1:6443
)? -
Is it expected for
kube-proxy
to directly query other master nodes, or should all requests go through the load balancer in HA mode? -
Should all inter-node communication (e.g., API requests) route through the load balancer in an HA deployment?
-
“no route to host” errors? Is it because
kube-proxy
is failing? Isn’t the role of kube proxy to route nodes IPs with rke2 pods virtual network? -
If direct communication between nodes is required, how should this be configured to avoid “no route to host” errors?
-
What are the key settings in the
kube-proxy
or RKE2 configuration files that need to be adjusted to ensure proper behavior in HA mode?