Seeking Help with Kube-Proxy and Network Errors on HA RKE2 Cluster Deployment

Hello everyone,

I’m troubleshooting an issue with my RKE2 cluster setup in High Availability (HA) mode and would greatly appreciate your insights. Here’s an updated description of the problem:

Environment:

**Cluster Setup**:
  • RKE2 in HA mode with 3 master nodes and 3 agent nodes

  • 1 NGINX VM acting as a load balancer.

    Current state:

All my master and worker nodes are successfully deployed and join the cluster, and register against the loadbalancer using its IP.

master01   Ready    control-plane,etcd,master   16h   v1.29.12+rke2r1
master02   Ready    control-plane,etcd,master   15h   v1.29.12+rke2r1
master03   Ready    control-plane,etcd,master   15h   v1.29.12+rke2r1
worker01   Ready    <none>                      16h   v1.29.12+rke2r1
worker02   Ready    <none>                      16h   v1.29.12+rke2r1
worker03   Ready    <none>                      15h   v1.29.12+rke2r1
  • The cluster was deployed with the Ansible role: Ansible LabLabs rke2

     .
    

    Issues:

  • Persistent kube-proxy errors across all master nodes

  • Network-related “no route to host” errors in the logs.

Logs from all Master Nodes:

    • Kube-proxy Errors
E0121 16:37:29.216029       1 server.go:1039] "Failed to retrieve node info" err="Get \"https://127.0.0.1:6443/api/v1/nodes/master01\": dial tcp 127.0.0.1:6443: connect: connection refused"
E0121 16:37:30.319575       1 server.go:1039] "Failed to retrieve node info" err="Get \"https://127.0.0.1:6443/api/v1/nodes/master01\": dial tcp 127.0.0.1:6443: connect: connection refused"
  1. RKE2 Server “No Route to Host” Errors by executing systemctl status rke2-server:
Jan 22 09:26:56 master01 rke2[752364]: time="2025-01-22T09:26:56+01:00" level=error msg="Sending HTTP/1.1 502 response to 127.0.0.1:52638: dial tcp 10.42.2.3:10250: connect: no route to host"
Jan 22 09:26:56 master01 rke2[752364]: time="2025-01-22T09:26:56+01:00" level=error msg="Sending HTTP/1.1 502 response to 127.0.0.1:52640: dial tcp 10.42.2.3:10250: connect: no route to host"
Jan 22 09:26:58 master01 rke2[752364]: time="2025-01-22T09:26:58+01:00" level=error msg="Sending HTTP/1.1 502 response to 127.0.0.1:52678: dial tcp 10.42.2.3:10250: connect: no route to host"

Observations:

  1. Connection Issues Across Masters: The kube-proxy errors appear on all master nodes, indicating trouble connecting to the API server (127.0.0.1:6443) during initialization
  2. No Route to Host Errors: All master nodes log “no route to host” errors when attempting to connect to a specific node IP (10.42.2.3:10250

Questions:

  • In an HA RKE2 setup, should kube-proxy requests always route through the load balancer when accessing the API server (127.0.0.1:6443)?

  • Is it expected for kube-proxy to directly query other master nodes, or should all requests go through the load balancer in HA mode?

  • Should all inter-node communication (e.g., API requests) route through the load balancer in an HA deployment?

  • “no route to host” errors? Is it because kube-proxy is failing? Isn’t the role of kube proxy to route nodes IPs with rke2 pods virtual network?

  • If direct communication between nodes is required, how should this be configured to avoid “no route to host” errors?

  • What are the key settings in the kube-proxy or RKE2 configuration files that need to be adjusted to ensure proper behavior in HA mode?