Cross-host intercontainer communication trouble

Hi @denise

Thanks for your response.

In the Rancher UI I see my 2 hosts with the public IP (same IP for both hosts) and not their private eth0 IP.
My Rancher server is on a third independant Host (where I have not included rancher agent).

Do you think I need to declare my host with my eth0 IP adderss instead of the public IP adress ?

If it has the same public IP, then yes, it would be better to use their private IP. You can just add the -e CATTLE_AGENT_IP=<private_ip> environment variable to the command to add the host.

thanks @denise, i think it way be the good way.

unfortunately today i’m not able to test it because my rancher-server has fill my 30G of data space on my server… i add a comment on this issue… https://github.com/rancher/rancher/issues/1676

is there a way to purge mysql ? or to export my data to launch another server ?

Yep, I’m trying to find the exact SQL statement that you could run to clean up your DB. I just realized that you’re using v0.38.0 and most likely have the clean up code. We’ll need to increase the frequency of your cleanup afterwards. Let’s get you up on v0.39.0 first.

hi @denise,

since my rancher has regain free space i have relaunch the agent with private IP and the overlay network now is working :smile: :smile:

I have made a small bench of request, and after a while I couldn’t reach my rancher-LB.
when scaling up my app the LB has reconfigured and works again, but the strange think was that there was no log error in the LB container is there a way to check sanity of the LB ?

thanks a lot for your help

Hi. I don’t know if my case is exatly identical, because I see other sympthoms.
Version: 0.39
Also, my cross-host networking (one webapp, one load balancer) doesn’t work.
My stack configuration:
docker-compose:

solr-webapp:
  restart: on-failure:5
  environment:
    INI: rancher
  external_links:
  - Default/solr-projects-external:solr-projects-external
  - Default/solr-publications-external:solr-publications-external
  tty: true
  image: uberresearch/solr_webapp:rancher
  stdin_open: true
webapp-lb:
  ports:
  - 80:6543
  restart: always
  tty: true
  image: rancher/load-balancer-service
  links:
  - solr-webapp:solr-webapp
  stdin_open: true

rancher-compose:

solr-webapp:
  scale: 1
webapp-lb:
  scale: 1
  load_balancer_config:
    name: webapp-lb config

Two things I wonder about:

  1. When looking at the graph, the load balancer is not linked to the service
  2. The webapp uses external services that are (no longer) displayed. I’m sure they were on an earlier install

My rancher master is running on a public subnet in a VPC (AWS) and the workers on the private subnet. I’ve tried adding the workers using the master public and private IP, the situation is the same. UDP port 500/4500 is open. Is this kind of configuration not feasible somehow?

Where should I look?

@Sebastien_Allamand - You can look at the haproxy config by following the instructions in our troubleshooting FAQS. We plan on expanding it in the next week or two based on issues that you’ve faced and others.

http://docs.rancher.com/rancher/faqs/troubleshooting/#how-can-i-see-the-configuration-of-my-load-balancer

@sdlarsen Typically the stack configuration has nothing to do with the actual issue of hosts not being able to communicate. Can you try logging into the network agent containers on one host and pinging the other network agent container?

How did your stack get created or where did you get this docker-compose.yml? When services are removed using the UI, it will typically update the docker-compose.yml that Rancher generates and remove those links.

When you refresh, can you see the links?

Hi @denise,

Thank you for your swift reply. The ping test showed a flaw in my network setup. Thank you for the suggestion and sorry for the noise.

By the way, is there any logging available that could have showed me this? Neither the UI nor rancher-compose complaints.

Br.
Søren

@sdlarsen Unfortunately, there is no where that really indicates that it’s not working. Due to the sheer number of people having similar networking issues, I’ve created https://github.com/rancher/rancher/issues/2222 as a feature request to be able to have rancher/server do some kind of checking and produce some kind of error message somewhere to make it easier to troubleshoot. :slight_smile:

@denise, excellent. Thank you.

Hello @denise,

I have a question, in order for the overlay network, does my private network for Hosts can be different ?
In my case I try to extend my network with hosts on 2 different datacenters, but network agent from datacenter different can’t ping each other (neither my containers)

in my case, big & medium are in same datacenter and can see each other, while lodz1 which is elsewhere can’t see others.

If you have any clue :wink:

sebastien

The agents will communicate with each other using the host ip that is shown in the UI/in your picture. So you are correct and containers on the 10. host won’t be able to see containers on the other 2, or vice versa.

Thanks @vincent,

So if I want to make them communicate I vale to switch my network on my other data center to something in 192.168 ??

That sound weird for me I was thinking that agent was making something like local nat translation

More like they need public IP addresses or a VPN between networks. Just switching the IP subnet won’t help. Each host needs to be able to communicate (on UDP ports 500 and 4500, but you can just consider ping for now) with the registered IP (displayed in the host box in the UI) for every other host for the full overlay network to work.

Right now you have 2 disparate networks, so only hosts in the same network will be able to communicate with each other. It is possible to use host labels to schedule containers such that the ones that need to talk to each other all live in the same network.

Thaks @vincent, I think I was misunderstanding the role of the agent, I was thinkning it was masking internal networks differences.

I can pu public IP for each of my internal host in order to allows each agent to see each other, but this approach is not good enough for production I would like not to expose each of my host on the internet.

If I understand well except with setting up a big VPN for all hosts on datacenter different I can’t make communicate different containers fromdifferent datacenter right ?

sébastien

Hello,

I have tried to set a public IP adress for each of my hosts. Then I declare each host within rancher adding option : -e CATTLE_AGENT_IP=<public_ip>.

All my hosts are registred and visible in the Rancher UI with their Public IP.

But I still encounter the problem that container in one datacenter can communicate each other, but can’t reach (or ping) containers in the other datacenter.

Rancher create container independantly within datacenter 1 or 2, but this makes my stack not working.

Does someone manage to communicates container in hosts within different datacenter using internal links of rancher-compose ?

sébastien

I’m having another issue with the same stack as posted above. The application is only reachable if the LB proxy is on the same host as the service (webapp).
The HAProxy seems to be setup correctly and the LB is listening on port 80:

root@32125630f2d4:/# cat /etc/haproxy/haproxy.cfg 
global
	log 127.0.0.1 local0
    	log 127.0.0.1 local1 notice
        maxconn 4096
        maxpipes 1024
	chroot /var/lib/haproxy
	user haproxy
	group haproxy
	daemon

defaults
	log	global
	mode	tcp
	option	tcplog
        option  dontlognull
        option  redispatch
        option forwardfor
        retries 3
        timeout connect 5000
        timeout client 50000
        timeout server 50000
	errorfile 400 /etc/haproxy/errors/400.http
	errorfile 403 /etc/haproxy/errors/403.http
	errorfile 408 /etc/haproxy/errors/408.http
	errorfile 500 /etc/haproxy/errors/500.http
	errorfile 502 /etc/haproxy/errors/502.http
	errorfile 503 /etc/haproxy/errors/503.http
	errorfile 504 /etc/haproxy/errors/504.http

frontend b2405680-6fb7-418c-996b-819e6b88b926_frontend
        bind 10.42.157.104:80
        mode http

    	default_backend b2405680-6fb7-418c-996b-819e6b88b926_0_backend

backend b2405680-6fb7-418c-996b-819e6b88b926_0_backend
        mode http
        balance roundrobin
        server e89c505b-fd79-47c4-a1e5-bb15dd7f117a 10.42.198.58:6543

IP tables save:

ubuntu@ip-172-20-0-67:~$ sudo iptables-save
# Generated by iptables-save v1.4.21 on Mon Oct  5 12:57:48 2015
*mangle
:PREROUTING ACCEPT [374102:639779789]
:INPUT ACCEPT [309133:546665524]
:FORWARD ACCEPT [64969:93114265]
:OUTPUT ACCEPT [180593:40406938]
:POSTROUTING ACCEPT [245562:133521203]
COMMIT
# Completed on Mon Oct  5 12:57:48 2015
# Generated by iptables-save v1.4.21 on Mon Oct  5 12:57:48 2015
*nat
:PREROUTING ACCEPT [54:3968]
:INPUT ACCEPT [1:84]
:OUTPUT ACCEPT [22:1651]
:POSTROUTING ACCEPT [74:4771]
:CATTLE_POSTROUTING - [0:0]
:CATTLE_PREROUTING - [0:0]
:DOCKER - [0:0]
-A PREROUTING -j CATTLE_PREROUTING
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -j CATTLE_POSTROUTING
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -s 172.17.0.4/32 -d 172.17.0.4/32 -p udp -m udp --dport 4500 -j MASQUERADE
-A POSTROUTING -s 172.17.0.4/32 -d 172.17.0.4/32 -p udp -m udp --dport 500 -j MASQUERADE
-A CATTLE_POSTROUTING -s 10.42.0.0/16 -d 169.254.169.250/32 -j ACCEPT
-A CATTLE_POSTROUTING -s 10.42.0.0/16 ! -d 10.42.0.0/16 -p tcp -j MASQUERADE --to-ports 1024-65535
-A CATTLE_POSTROUTING -s 10.42.0.0/16 ! -d 10.42.0.0/16 -p udp -j MASQUERADE --to-ports 1024-65535
-A CATTLE_POSTROUTING -s 10.42.0.0/16 ! -d 10.42.0.0/16 -j MASQUERADE
-A CATTLE_POSTROUTING -s 172.17.0.0/16 ! -o docker0 -p tcp -j MASQUERADE --to-ports 1024-65535
-A CATTLE_POSTROUTING -s 172.17.0.0/16 ! -o docker0 -p udp -j MASQUERADE --to-ports 1024-65535
-A CATTLE_PREROUTING -p udp -m addrtype --dst-type LOCAL -m udp --dport 4500 -j DNAT --to-destination 10.42.177.254:4500
-A CATTLE_PREROUTING -p udp -m addrtype --dst-type LOCAL -m udp --dport 500 -j DNAT --to-destination 10.42.177.254:500
-A CATTLE_PREROUTING -p tcp -m addrtype --dst-type LOCAL -m tcp --dport 80 -j DNAT --to-destination 10.42.157.104:80
-A DOCKER ! -i docker0 -p udp -m udp --dport 4500 -j DNAT --to-destination 172.17.0.4:4500
-A DOCKER ! -i docker0 -p udp -m udp --dport 500 -j DNAT --to-destination 172.17.0.4:500
COMMIT
# Completed on Mon Oct  5 12:57:48 2015
# Generated by iptables-save v1.4.21 on Mon Oct  5 12:57:48 2015
*filter
:INPUT ACCEPT [309142:546671377]
:FORWARD ACCEPT [12:720]
:OUTPUT ACCEPT [180615:40418887]
:DOCKER - [0:0]
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER -d 172.17.0.4/32 ! -i docker0 -o docker0 -p udp -m udp --dport 4500 -j ACCEPT
-A DOCKER -d 172.17.0.4/32 ! -i docker0 -o docker0 -p udp -m udp --dport 500 -j ACCEPT
COMMIT
# Completed on Mon Oct  5 12:57:48 2015

And the container running the service (webapp) on the other host (docker inspect):

[
{
    "Id": "98eff28178730175b29f18804de302f43f3022b97f80a0c390a96941b50d4c20",
    "Created": "2015-10-05T12:33:31.991420993Z",
    "Path": "/bin/sh",
    "Args": [
        "-c",
        "./docker/start.sh"
    ],
    "State": {
        "Running": true,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 24047,
        "ExitCode": 0,
        "Error": "",
        "StartedAt": "2015-10-05T12:33:32.140555869Z",
        "FinishedAt": "0001-01-01T00:00:00Z"
    },
    "Image": "75793386b8db4ba30202450b2a579c5098185db2edb5d1f73bf0d3596ccd41a4",
    "NetworkSettings": {
        "Bridge": "",
        "EndpointID": "fe649512bc7b6940934552fb4b848c8c8ec9814339fb7cc87fbb09557fed12dc",
        "Gateway": "172.17.42.1",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "HairpinMode": false,
        "IPAddress": "172.17.0.10",
        "IPPrefixLen": 16,
        "IPv6Gateway": "",
        "LinkLocalIPv6Address": "",
        "LinkLocalIPv6PrefixLen": 0,
        "MacAddress": "02:7f:75:e9:50:aa",
        "NetworkID": "199bf3d928ffae11ed77e9cf42c6a429fbdf31e613e7485a918e8b57063730ca",
        "PortMapping": null,
        "Ports": {
            "6543/tcp": null
        },
        "SandboxKey": "/var/run/docker/netns/98eff2817873",
        "SecondaryIPAddresses": null,
        "SecondaryIPv6Addresses": null
    },
    "ResolvConfPath": "/var/lib/docker/containers/98eff28178730175b29f18804de302f43f3022b97f80a0c390a96941b50d4c20/resolv.conf",
    "HostnamePath": "/var/lib/docker/containers/98eff28178730175b29f18804de302f43f3022b97f80a0c390a96941b50d4c20/hostname",
    "HostsPath": "/var/lib/docker/containers/98eff28178730175b29f18804de302f43f3022b97f80a0c390a96941b50d4c20/hosts",
    "LogPath": "/var/lib/docker/containers/98eff28178730175b29f18804de302f43f3022b97f80a0c390a96941b50d4c20/98eff28178730175b29f18804de302f43f3022b97f80a0c390a96941b50d4c20-json.log",
    "Name": "/09d02248-7380-4034-be6c-3cf57527d207",
    "RestartCount": 0,
    "Driver": "aufs",
    "ExecDriver": "native-0.2",
    "MountLabel": "",
    "ProcessLabel": "",
    "AppArmorProfile": "",
    "ExecIDs": null,
    "HostConfig": {
        "Binds": null,
        "ContainerIDFile": "",
        "LxcConf": null,
        "Memory": 0,
        "MemorySwap": 0,
        "CpuShares": 0,
        "CpuPeriod": 0,
        "CpusetCpus": "",
        "CpusetMems": "",
        "CpuQuota": 0,
        "BlkioWeight": 0,
        "OomKillDisable": false,
        "MemorySwappiness": null,
        "Privileged": false,
        "PortBindings": null,
        "Links": null,
        "PublishAllPorts": false,
        "Dns": null,
        "DnsSearch": null,
        "ExtraHosts": null,
        "VolumesFrom": null,
        "Devices": null,
        "NetworkMode": "default",
        "IpcMode": "",
        "PidMode": "",
        "UTSMode": "",
        "CapAdd": null,
        "CapDrop": null,
        "GroupAdd": null,
        "RestartPolicy": {
            "Name": "on-failure",
            "MaximumRetryCount": 5
        },
        "SecurityOpt": null,
        "ReadonlyRootfs": false,
        "Ulimits": null,
        "LogConfig": {
            "Type": "json-file",
            "Config": {}
        },
        "CgroupParent": "",
        "ConsoleSize": [
            0,
            0
        ]
    },
    "GraphDriver": {
        "Name": "aufs",
        "Data": null
    },
    "Mounts": [],
    "Config": {
        "Hostname": "98eff2817873",
        "Domainname": "",
        "User": "",
        "AttachStdin": false,
        "AttachStdout": false,
        "AttachStderr": false,
        "ExposedPorts": {
            "6543/tcp": {}
        },
        "PublishService": "",
        "Tty": true,
        "OpenStdin": true,
        "StdinOnce": false,
        "Env": [
            "INI=rancher",
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "DEBIAN_FRONTEND=noninteractive",
            "PYTHON_EGG_CACHE=/pack/app/.egg-cache"
        ],
        "Cmd": [
            "/bin/sh",
            "-c",
            "./docker/start.sh"
        ],
        "Image": "uberresearch/solr_webapp:rancher",
        "Volumes": null,
        "VolumeDriver": "",
        "WorkingDir": "/pack/app",
        "Entrypoint": null,
        "NetworkDisabled": false,
        "MacAddress": "02:7f:75:e9:50:aa",
        "OnBuild": null,
        "Labels": {
            "io.rancher.container.ip": "10.42.198.58/16",
            "io.rancher.container.uuid": "09d02248-7380-4034-be6c-3cf57527d207",
            "io.rancher.project.name": "webapp",
            "io.rancher.project_service.name": "webapp/solr-webapp",
            "io.rancher.scheduler.affinity:container_label_soft": "io.rancher.service.deployment.unit=53e4984f-5de4-41a0-a793-e3d55137f236",
            "io.rancher.service.deployment.unit": "53e4984f-5de4-41a0-a793-e3d55137f236",
            "io.rancher.service.launch.config": "io.rancher.service.primary.launch.config",
            "io.rancher.stack.name": "webapp",
            "io.rancher.stack_service.name": "webapp/solr-webapp"
        }
    }
}
]

The hosts can ping each other, all ports are open, but the requests never reach the webapp. The result is always:

<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>

Btw. I see the same racoon error as in Racon fail in Network agent 0.4.1 on RancherOS 0.33 - might that be the reason? (it’s on ubuntu 14.04.3 LTS - 3.19.30 though - not RancherOS)

How do I proceed from here?

@Sebastien_Allamand @sdlarsen

We’ve just updated troubleshooting docs for additional information on how agents/server and cross host containers should be able to communicate. Can you read to see if it might help you?

http://docs.rancher.com/rancher/faqs/troubleshooting/

We are running into a similar issue, we’re unable to determine whether it is a network configuration issue or a Rancher issue.

Set up:

  • 2 VPCs on East and West
  • VPN tunnel between them
  • All hosts added to rancher using their private IPs
  • Physical hosts can talk just fine using private IPs cross-VPC

Behavior:

  • Can ping the Network Agent’s IP cross-host when hosts are in same VPC
  • Cannot to ping Network Agent’s IP cross-host when in different VPCs
  • Can ping the private IP address of the host machine from a Network Agent container on a host in different VPC
    • So, in other words, a container can communicate with the physical host cross-VPC (using its private IP) but cannot communicate with another container on that host using the container IP on the 10.42.0.0 network

We have tested all the configuration options we can think of on the network side and do not see anything that should be the culprit.

  • The VPN tunnel routes correctly between both VPCs. We even added route 10.42.0.0 255.255.0.0 just to test and no luck
  • 100% the security groups on the host machines are correct (UDP ports 500/4500 open, etc)
  • We also added the 10.42.0.0 network to our route tables to point to the VPN

Trace route:

  • Tracing route from container on VPC West to Physical Host IP on VPC East shows routing first to 172.17.., then to the VPN on the West, then to the VPN on East (as it should)
  • Tracing route from container on VPC West to a container on VPC East, however, shows routing through 10.42.. and then just stops.

Additional info:

  • We ran a test where we added two hosts to rancher using their public IPs, one to each VPC. Containers on those machines are able to successfully communicate cross-VPC.

FWIW, we too had a similar issue with two preexisting ubuntu machines where we ran the rancher/agent container. The containers on those two machines simply could not communicate.

We then deployed two new machines from within Rancher and found that the new machines worked!

The investigation went something like:

  • Our test case was as follows: click “Execute shell” on the “Network Agent” container on machine1. Type ping <ip-of-network-agent-on-machine-2>. We knew this had to work, but simply didn’t.
  • The iptables rules on both machines were fine, and we were able to verify this by running an nc -ul 501 on machine1 and echo hello | nc -u localhost 501 on machine2.
  • On machine1, list the ipsec configurations: swanctl --list-conns. Note that there is a configuration for machine2 (so the rancher agent was doing its job).
  • Then, list the ipsec active tunnels: swanctl --list-sas. We noted that there was no active tunnel (see image below)
  • Next we checked the ipsec logs: cat /var/log/rancher-net.log. I don’t have them at hand now but whenever we saw an attempt to establish the tunnel, it quickly errored out with: “No such file or directory”.
  • That error message isn’t particularly useful but a google search led us to believe it was a missing kernel module on the host machine itself. Sure enough, we ssh’ed to the host (machine1) and tried a quick modprobe: modprobe authenc. We got back a “No such file or directory” error.
  • We then tried: depmod -a and still got: depmod: FATAL: could not search modules: No such file or directory.
  • At this point it was clear that we had an outdated kernel and our modules had probably been cleaned (ubuntu machine).
  • The final solution was a simple upgrade of linux-image and a quick reboot.

While it may not be the same problem you’re experiencing, I’m writing here as our investigation may help others to diagnose their issues.