Rancher Not Using Internal Address for Container Traffic

I’m trying to figure out how to make Rancher use a specific subnet for all traffic that doesn’t need to accessible outside of K8s cluster. The servers my cluster is running on has 3 networks a 10.0.2.0/24 for management of the physical hosts, 10.0.3.0/24 for external docker traffic and a 10.0.4.0/24 for internal docker traffic. The internal network is 10gb and external network is 1gb so I want all the internal traffic using the faster network. This is especially important for stuff like Longhorn that needs to move a lot of data.

My nodes are just docker on bare metal started with a command like this and when I do a tcpdump against the 1gb interface I see all of traffic there.

Note that rancher-private resolves to a 10.0.4.0/24 address and I do see traffic on port 6443 between the rancher agents on the 10gb network but nothing else

sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.3.2 --server https://rancher-private.dev.example.com --token zf4bkvvjkn4q5547gkgc6x8bd5nnl47zthl6t5lmthv7gs4h5q6qzz --ca-checksum 75e28964c7f30bfbb2e3e30e458b557c3d6197664159767356b486a428893c00 --address 10.0.3.11 --internal-address 10.0.4.11 --worker

2 Likes

Pretty much same issue here.

When using the internal address and address, it should route properly. This sounds like a bug. If you can confirm the behavior, please put an issue in. https://github.com/rancher/rancher/issues/new

@JasonvanBrackel

Nr. 1 is the IP address that is desired and that is set in the command starting the worker node, as shown below.
Nr. 2 is the other IP address I do not want to be part of Rancher at all. The necessary setup for the network traffic is done outside of Rancher. All Rancher needs to do is using the first desired IP address, which is 10.1.1.2 in the below example. It should ignore the other address entirely and it should not show up in any YAMLs or anywhere within Rancher (which it does).

Starting worker node with:

sudo docker run -d --privileged --restart=unless-stopped --net=host \
-v /etc/kubernetes:/etc/kubernetes \
-v /var/run:/var/run rancher/rancher-agent:v2.3.3 \
--server https://10.1.1.1:8443 --address 10.1.1.2 --internal-address 10.1.1.2 \
--token theactualtoken --ca-checksum thcachecksum --worker

It’s been a while but I’m finally back to testing this. My issue is where the traffic is flowing not what IP Rancher is accessible on. When I install something like Longhorn all of the Longhorn traffic is flowing over the 10.0.3.0 network instead of the 10.0.4.0 network. That’s a problem because 10.0.3.0 is on 1GB and 10.0.4.0 is on 10GB making Longhorn perform really slowly. All of this behavior has been confirmed using tcpdump. I’ve submitted https://github.com/rancher/rancher/issues/27109 for the issue.

Also need to note the 10gb network does not have internet so I can’t just run everything over that network.

For anyone else needing to control which adapter the overlay network runs on, here is the config for your RKE file if you are using Canal the default CNI.

network:
  plugin: canal
  options:
    canal_iface: eth1
    canal_flannel_backend_type: vxlan
1 Like

In this example eth1 is your 10G network, right? And are services such as metallb still able to use the 1g network?

Sorry, but no, this does not work. I’m using some bare-metal nodes from Hetzner Cloud which are connected via a private network (10.1.0.0/16) and I can add nodes just fine with

# Control node
docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run  rancher/rancher-agent:v2.5.7 --server https://my.rancher.control.node --token [REDACTED] --address ens10 --internal-address ens10 --etcd --controlplane --worker
# Worker node
docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run  rancher/rancher-agent:v2.5.7 --server https://my.rancher.control.node --token [REDACTED] --address ens10 --internal-address ens10 --worker

I tried applying the configuration from this GitHub issue, but no success either. When I deploy anything I can for instance only see logs from pods running on one node, but not from the other. The timeout error message from kubectl has the pubic IP of the node which is no surprise as I block access to that port via firewall rules.

This is the part of my cluster config:

  network:
    canal_network_provider:
      iface: ens10
    flannel_network_provider:
      iface: ens10
    mtu: 0
    options:
      canal_iface: ens10
      flannel_backend_type: vxlan
      flannel_iface: ens10
    plugin: canal

ens10 is the name of the virtual(?) network interface that is connected to the 10.1.0.0/16 network.

2 Likes

It seems the trick is to utilize TWO private networks and not rely on Canal as CNI. I’ve had success with Weave as a CNI, described in detail at this GitHub issue.

1 Like