Cross Host Networking Issues

Ok, so got a Cattle cluster going on and for some reason I cannot connect to my application across the managed network. So, I have my service container, we’ll call it service, and I have my worker containers, we’ll call those worker1, worker2, etc spread across my three hosts.
So when the service and worker are on the same host then I can ping the service from the worker and can connect to my TCP port as expected. If the worker and service containers are on different hosts, but still on the Managed network, I loose all connectivity between the two, whilst the one’s on the same host continue to work.
As a test, I moved the service container to the Host network and I can now ping from workers on other hosts, but my TCP connection fails.

Docker version: Docker version 1.12.6, build 78d1802
Rancher Version: 1.5.2
Rancher Agent: 1.2.1 <==Not sure why this didn’t upgrade…

Now the fun part, I converted the environment to Swarm mode and everything is working just like I would expect, but going back to Cattle things break in exactly the same way. When I see this in the doc: “All containers in the managed network are able to communicate with each other regardless of which host the container was deployed on.” I am expecting that my workers will be able to talk to my service without issue across the managed network.

The service and worker are two services in the same stack, but I made sure and check that there’s no network policy restrictions in place, so this should be working. Not sure if it help but this is an upgrade from previous Rancher releases, with 1.2.1 being the most recent. So not completely sure of something that went sideways during that and that’s why I’m on 1.2.1 rancher agent, just not sure on that one.

I’m open to suggestions, and maybe even moving to using VXLAN instead of IPSEC, I’m not above that.

Each image has it’s own version; 1.2.1 is the correct Agent for Rancher 1.5.1.

For IPSec each host connects to every other host as needed. So each host needs:

  • A unique IP address, different from every other host (if two hosts have the same IP something is wrong)
  • All hosts mutually reachable using their registered IP (as shown in UI)
  • Over 500/udp and 4500/udp.

Ok, wasn’t sure if the rancher agent would match the server version, maybe some type of matrix could be published?

So, as for the unique IP, well so yes these are on a regular Ethernet network so no duplicate IP’s here. Also, I can ping between the hosts it’s other TCP ports that are failing me. With that I have to assume that the ports aren’t blocked and the IP’s are reachable.

As an off chance that something in my network has changed since that’s not in my control. I went ahead and built a new environment using VXLAN instead of IPSEC and moved my hosts over to this new environment and what do ya know, everything’s working exactly as expected. So the only thing I can figure is something in the IPSEC transport is goofed with the latest update or something in my network, which is Cisco Nexus based, is breaking parts of the IPSEC. Thinking of how old and quite frankly depreciated IPSEC is that is most likely the root cause, and since my company has decided to move to Cisco ACI which is entirely VXLAN based it makes sense for me to make the switch.

Coincidently, I don’t have services like health check or scheduler that get stuck in the initializing state, which I figure was due to the management network issues I was having. Also the load balance service that I have as part of my application stack is working very well, so I’m going to dig back into keepalived to provide a VIP to the load balanced solution.

@tibmeister If you run into a problem next time, please share the logs of network-manager, ipsec-router, ipsec-cni containers.