Cross-host intercontainer communication trouble

Thanks @vincent,

So if I want to make them communicate I vale to switch my network on my other data center to something in 192.168 ??

That sound weird for me I was thinking that agent was making something like local nat translation

More like they need public IP addresses or a VPN between networks. Just switching the IP subnet won’t help. Each host needs to be able to communicate (on UDP ports 500 and 4500, but you can just consider ping for now) with the registered IP (displayed in the host box in the UI) for every other host for the full overlay network to work.

Right now you have 2 disparate networks, so only hosts in the same network will be able to communicate with each other. It is possible to use host labels to schedule containers such that the ones that need to talk to each other all live in the same network.

Thaks @vincent, I think I was misunderstanding the role of the agent, I was thinkning it was masking internal networks differences.

I can pu public IP for each of my internal host in order to allows each agent to see each other, but this approach is not good enough for production I would like not to expose each of my host on the internet.

If I understand well except with setting up a big VPN for all hosts on datacenter different I can’t make communicate different containers fromdifferent datacenter right ?

sébastien

Hello,

I have tried to set a public IP adress for each of my hosts. Then I declare each host within rancher adding option : -e CATTLE_AGENT_IP=<public_ip>.

All my hosts are registred and visible in the Rancher UI with their Public IP.

But I still encounter the problem that container in one datacenter can communicate each other, but can’t reach (or ping) containers in the other datacenter.

Rancher create container independantly within datacenter 1 or 2, but this makes my stack not working.

Does someone manage to communicates container in hosts within different datacenter using internal links of rancher-compose ?

sébastien

I’m having another issue with the same stack as posted above. The application is only reachable if the LB proxy is on the same host as the service (webapp).
The HAProxy seems to be setup correctly and the LB is listening on port 80:

root@32125630f2d4:/# cat /etc/haproxy/haproxy.cfg 
global
	log 127.0.0.1 local0
    	log 127.0.0.1 local1 notice
        maxconn 4096
        maxpipes 1024
	chroot /var/lib/haproxy
	user haproxy
	group haproxy
	daemon

defaults
	log	global
	mode	tcp
	option	tcplog
        option  dontlognull
        option  redispatch
        option forwardfor
        retries 3
        timeout connect 5000
        timeout client 50000
        timeout server 50000
	errorfile 400 /etc/haproxy/errors/400.http
	errorfile 403 /etc/haproxy/errors/403.http
	errorfile 408 /etc/haproxy/errors/408.http
	errorfile 500 /etc/haproxy/errors/500.http
	errorfile 502 /etc/haproxy/errors/502.http
	errorfile 503 /etc/haproxy/errors/503.http
	errorfile 504 /etc/haproxy/errors/504.http

frontend b2405680-6fb7-418c-996b-819e6b88b926_frontend
        bind 10.42.157.104:80
        mode http

    	default_backend b2405680-6fb7-418c-996b-819e6b88b926_0_backend

backend b2405680-6fb7-418c-996b-819e6b88b926_0_backend
        mode http
        balance roundrobin
        server e89c505b-fd79-47c4-a1e5-bb15dd7f117a 10.42.198.58:6543

IP tables save:

ubuntu@ip-172-20-0-67:~$ sudo iptables-save
# Generated by iptables-save v1.4.21 on Mon Oct  5 12:57:48 2015
*mangle
:PREROUTING ACCEPT [374102:639779789]
:INPUT ACCEPT [309133:546665524]
:FORWARD ACCEPT [64969:93114265]
:OUTPUT ACCEPT [180593:40406938]
:POSTROUTING ACCEPT [245562:133521203]
COMMIT
# Completed on Mon Oct  5 12:57:48 2015
# Generated by iptables-save v1.4.21 on Mon Oct  5 12:57:48 2015
*nat
:PREROUTING ACCEPT [54:3968]
:INPUT ACCEPT [1:84]
:OUTPUT ACCEPT [22:1651]
:POSTROUTING ACCEPT [74:4771]
:CATTLE_POSTROUTING - [0:0]
:CATTLE_PREROUTING - [0:0]
:DOCKER - [0:0]
-A PREROUTING -j CATTLE_PREROUTING
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -j CATTLE_POSTROUTING
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -s 172.17.0.4/32 -d 172.17.0.4/32 -p udp -m udp --dport 4500 -j MASQUERADE
-A POSTROUTING -s 172.17.0.4/32 -d 172.17.0.4/32 -p udp -m udp --dport 500 -j MASQUERADE
-A CATTLE_POSTROUTING -s 10.42.0.0/16 -d 169.254.169.250/32 -j ACCEPT
-A CATTLE_POSTROUTING -s 10.42.0.0/16 ! -d 10.42.0.0/16 -p tcp -j MASQUERADE --to-ports 1024-65535
-A CATTLE_POSTROUTING -s 10.42.0.0/16 ! -d 10.42.0.0/16 -p udp -j MASQUERADE --to-ports 1024-65535
-A CATTLE_POSTROUTING -s 10.42.0.0/16 ! -d 10.42.0.0/16 -j MASQUERADE
-A CATTLE_POSTROUTING -s 172.17.0.0/16 ! -o docker0 -p tcp -j MASQUERADE --to-ports 1024-65535
-A CATTLE_POSTROUTING -s 172.17.0.0/16 ! -o docker0 -p udp -j MASQUERADE --to-ports 1024-65535
-A CATTLE_PREROUTING -p udp -m addrtype --dst-type LOCAL -m udp --dport 4500 -j DNAT --to-destination 10.42.177.254:4500
-A CATTLE_PREROUTING -p udp -m addrtype --dst-type LOCAL -m udp --dport 500 -j DNAT --to-destination 10.42.177.254:500
-A CATTLE_PREROUTING -p tcp -m addrtype --dst-type LOCAL -m tcp --dport 80 -j DNAT --to-destination 10.42.157.104:80
-A DOCKER ! -i docker0 -p udp -m udp --dport 4500 -j DNAT --to-destination 172.17.0.4:4500
-A DOCKER ! -i docker0 -p udp -m udp --dport 500 -j DNAT --to-destination 172.17.0.4:500
COMMIT
# Completed on Mon Oct  5 12:57:48 2015
# Generated by iptables-save v1.4.21 on Mon Oct  5 12:57:48 2015
*filter
:INPUT ACCEPT [309142:546671377]
:FORWARD ACCEPT [12:720]
:OUTPUT ACCEPT [180615:40418887]
:DOCKER - [0:0]
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER -d 172.17.0.4/32 ! -i docker0 -o docker0 -p udp -m udp --dport 4500 -j ACCEPT
-A DOCKER -d 172.17.0.4/32 ! -i docker0 -o docker0 -p udp -m udp --dport 500 -j ACCEPT
COMMIT
# Completed on Mon Oct  5 12:57:48 2015

And the container running the service (webapp) on the other host (docker inspect):

[
{
    "Id": "98eff28178730175b29f18804de302f43f3022b97f80a0c390a96941b50d4c20",
    "Created": "2015-10-05T12:33:31.991420993Z",
    "Path": "/bin/sh",
    "Args": [
        "-c",
        "./docker/start.sh"
    ],
    "State": {
        "Running": true,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 24047,
        "ExitCode": 0,
        "Error": "",
        "StartedAt": "2015-10-05T12:33:32.140555869Z",
        "FinishedAt": "0001-01-01T00:00:00Z"
    },
    "Image": "75793386b8db4ba30202450b2a579c5098185db2edb5d1f73bf0d3596ccd41a4",
    "NetworkSettings": {
        "Bridge": "",
        "EndpointID": "fe649512bc7b6940934552fb4b848c8c8ec9814339fb7cc87fbb09557fed12dc",
        "Gateway": "172.17.42.1",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "HairpinMode": false,
        "IPAddress": "172.17.0.10",
        "IPPrefixLen": 16,
        "IPv6Gateway": "",
        "LinkLocalIPv6Address": "",
        "LinkLocalIPv6PrefixLen": 0,
        "MacAddress": "02:7f:75:e9:50:aa",
        "NetworkID": "199bf3d928ffae11ed77e9cf42c6a429fbdf31e613e7485a918e8b57063730ca",
        "PortMapping": null,
        "Ports": {
            "6543/tcp": null
        },
        "SandboxKey": "/var/run/docker/netns/98eff2817873",
        "SecondaryIPAddresses": null,
        "SecondaryIPv6Addresses": null
    },
    "ResolvConfPath": "/var/lib/docker/containers/98eff28178730175b29f18804de302f43f3022b97f80a0c390a96941b50d4c20/resolv.conf",
    "HostnamePath": "/var/lib/docker/containers/98eff28178730175b29f18804de302f43f3022b97f80a0c390a96941b50d4c20/hostname",
    "HostsPath": "/var/lib/docker/containers/98eff28178730175b29f18804de302f43f3022b97f80a0c390a96941b50d4c20/hosts",
    "LogPath": "/var/lib/docker/containers/98eff28178730175b29f18804de302f43f3022b97f80a0c390a96941b50d4c20/98eff28178730175b29f18804de302f43f3022b97f80a0c390a96941b50d4c20-json.log",
    "Name": "/09d02248-7380-4034-be6c-3cf57527d207",
    "RestartCount": 0,
    "Driver": "aufs",
    "ExecDriver": "native-0.2",
    "MountLabel": "",
    "ProcessLabel": "",
    "AppArmorProfile": "",
    "ExecIDs": null,
    "HostConfig": {
        "Binds": null,
        "ContainerIDFile": "",
        "LxcConf": null,
        "Memory": 0,
        "MemorySwap": 0,
        "CpuShares": 0,
        "CpuPeriod": 0,
        "CpusetCpus": "",
        "CpusetMems": "",
        "CpuQuota": 0,
        "BlkioWeight": 0,
        "OomKillDisable": false,
        "MemorySwappiness": null,
        "Privileged": false,
        "PortBindings": null,
        "Links": null,
        "PublishAllPorts": false,
        "Dns": null,
        "DnsSearch": null,
        "ExtraHosts": null,
        "VolumesFrom": null,
        "Devices": null,
        "NetworkMode": "default",
        "IpcMode": "",
        "PidMode": "",
        "UTSMode": "",
        "CapAdd": null,
        "CapDrop": null,
        "GroupAdd": null,
        "RestartPolicy": {
            "Name": "on-failure",
            "MaximumRetryCount": 5
        },
        "SecurityOpt": null,
        "ReadonlyRootfs": false,
        "Ulimits": null,
        "LogConfig": {
            "Type": "json-file",
            "Config": {}
        },
        "CgroupParent": "",
        "ConsoleSize": [
            0,
            0
        ]
    },
    "GraphDriver": {
        "Name": "aufs",
        "Data": null
    },
    "Mounts": [],
    "Config": {
        "Hostname": "98eff2817873",
        "Domainname": "",
        "User": "",
        "AttachStdin": false,
        "AttachStdout": false,
        "AttachStderr": false,
        "ExposedPorts": {
            "6543/tcp": {}
        },
        "PublishService": "",
        "Tty": true,
        "OpenStdin": true,
        "StdinOnce": false,
        "Env": [
            "INI=rancher",
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "DEBIAN_FRONTEND=noninteractive",
            "PYTHON_EGG_CACHE=/pack/app/.egg-cache"
        ],
        "Cmd": [
            "/bin/sh",
            "-c",
            "./docker/start.sh"
        ],
        "Image": "uberresearch/solr_webapp:rancher",
        "Volumes": null,
        "VolumeDriver": "",
        "WorkingDir": "/pack/app",
        "Entrypoint": null,
        "NetworkDisabled": false,
        "MacAddress": "02:7f:75:e9:50:aa",
        "OnBuild": null,
        "Labels": {
            "io.rancher.container.ip": "10.42.198.58/16",
            "io.rancher.container.uuid": "09d02248-7380-4034-be6c-3cf57527d207",
            "io.rancher.project.name": "webapp",
            "io.rancher.project_service.name": "webapp/solr-webapp",
            "io.rancher.scheduler.affinity:container_label_soft": "io.rancher.service.deployment.unit=53e4984f-5de4-41a0-a793-e3d55137f236",
            "io.rancher.service.deployment.unit": "53e4984f-5de4-41a0-a793-e3d55137f236",
            "io.rancher.service.launch.config": "io.rancher.service.primary.launch.config",
            "io.rancher.stack.name": "webapp",
            "io.rancher.stack_service.name": "webapp/solr-webapp"
        }
    }
}
]

The hosts can ping each other, all ports are open, but the requests never reach the webapp. The result is always:

<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

Btw. I see the same racoon error as in Racon fail in Network agent 0.4.1 on RancherOS 0.33 - might that be the reason? (it’s on ubuntu 14.04.3 LTS - 3.19.30 though - not RancherOS)

How do I proceed from here?

@Sebastien_Allamand @sdlarsen

We’ve just updated troubleshooting docs for additional information on how agents/server and cross host containers should be able to communicate. Can you read to see if it might help you?

http://docs.rancher.com/rancher/faqs/troubleshooting/

We are running into a similar issue, we’re unable to determine whether it is a network configuration issue or a Rancher issue.

Set up:

  • 2 VPCs on East and West
  • VPN tunnel between them
  • All hosts added to rancher using their private IPs
  • Physical hosts can talk just fine using private IPs cross-VPC

Behavior:

  • Can ping the Network Agent’s IP cross-host when hosts are in same VPC
  • Cannot to ping Network Agent’s IP cross-host when in different VPCs
  • Can ping the private IP address of the host machine from a Network Agent container on a host in different VPC
    • So, in other words, a container can communicate with the physical host cross-VPC (using its private IP) but cannot communicate with another container on that host using the container IP on the 10.42.0.0 network

We have tested all the configuration options we can think of on the network side and do not see anything that should be the culprit.

  • The VPN tunnel routes correctly between both VPCs. We even added route 10.42.0.0 255.255.0.0 just to test and no luck
  • 100% the security groups on the host machines are correct (UDP ports 500/4500 open, etc)
  • We also added the 10.42.0.0 network to our route tables to point to the VPN

Trace route:

  • Tracing route from container on VPC West to Physical Host IP on VPC East shows routing first to 172.17.., then to the VPN on the West, then to the VPN on East (as it should)
  • Tracing route from container on VPC West to a container on VPC East, however, shows routing through 10.42.. and then just stops.

Additional info:

  • We ran a test where we added two hosts to rancher using their public IPs, one to each VPC. Containers on those machines are able to successfully communicate cross-VPC.

FWIW, we too had a similar issue with two preexisting ubuntu machines where we ran the rancher/agent container. The containers on those two machines simply could not communicate.

We then deployed two new machines from within Rancher and found that the new machines worked!

The investigation went something like:

  • Our test case was as follows: click “Execute shell” on the “Network Agent” container on machine1. Type ping <ip-of-network-agent-on-machine-2>. We knew this had to work, but simply didn’t.
  • The iptables rules on both machines were fine, and we were able to verify this by running an nc -ul 501 on machine1 and echo hello | nc -u localhost 501 on machine2.
  • On machine1, list the ipsec configurations: swanctl --list-conns. Note that there is a configuration for machine2 (so the rancher agent was doing its job).
  • Then, list the ipsec active tunnels: swanctl --list-sas. We noted that there was no active tunnel (see image below)
  • Next we checked the ipsec logs: cat /var/log/rancher-net.log. I don’t have them at hand now but whenever we saw an attempt to establish the tunnel, it quickly errored out with: “No such file or directory”.
  • That error message isn’t particularly useful but a google search led us to believe it was a missing kernel module on the host machine itself. Sure enough, we ssh’ed to the host (machine1) and tried a quick modprobe: modprobe authenc. We got back a “No such file or directory” error.
  • We then tried: depmod -a and still got: depmod: FATAL: could not search modules: No such file or directory.
  • At this point it was clear that we had an outdated kernel and our modules had probably been cleaned (ubuntu machine).
  • The final solution was a simple upgrade of linux-image and a quick reboot.

While it may not be the same problem you’re experiencing, I’m writing here as our investigation may help others to diagnose their issues.