Cross-host intercontainer communication trouble

I have installed the latest version (0.35) of Rancher and have problems with networking.
Containers started on the same host can ping each other, but containers on different hosts do not see each other.
Communication between the hosts works. I can control containers on one host through rancher server GUI running on another.

I started containers in Rancher GUI. All containers use managed network.
I login to running containers with docker exec and run, for example, ping 10.42.78.96.

I see no errors in Network Agents logs. As a matter of fact, there are no logs for today at all.

1 Like

There could be a couple of different things happening.

  1. Have you made sure to open UDP ports 500 and 4500 between the hosts? Note: Being able to add containers to one host through rancher server GUI does not mean that the IPsec networking (how containers communicate cross host) is working.

  2. It looks like your poweredge320 host is running rancher/server and rancher/agent. Is the IP shown on the screen the IP of your box? When launching rancher/agent on the same machine as rancher/server, you sometimes need to launch it with an extra environment variable. See http://docs.rancher.com/rancher/rancher-ui/infrastructure/hosts/custom/#samehost

  3. Assuming number 1 and 2 are okay. Sometimes, having only 1 container will not have set up the networking properly on a host. It’s not always reproducible and not consistent. Can you try adding another container to pwr321 host? This will kind of reset the networking and might work.

Thank you, Denise.
I try to reregister my hosts with Rancher server, but I cannot register host that server is running on.
I start it like this:

sudo docker run -d -e CATTLE_AGENT_IP="172.19.5.5" --privileged -v /var/run/docker.sock:/var/run/docker.sock rancher/agent:v0.8.1 http://poweredge320.aics10.riken.jp:8000/v1/scripts/xxxxsomekeyxxxx

But the host doesn’t appear on Hosts list in GUI.
The rancher-agent container is constantly restarting itself.
In it’s logs I can see

ERROR: Please re-register this agent
ERROR: Please re-register this agent
ERROR: Please re-register this agent

I also have 2 stopped containers.
One is rancher-agent-state. In it’s logs:

Get http:///var/run/docker.sock/v1.18/images/aea8977b285c/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?

Another has strange name but this one is also from rancher/agent image, like the two mentioned above. In its logs I can see:

INFO: Running Agent Registration Process, CATTLE_URL=http://poweredge320.aics10.riken.jp:8000/v1
INFO: Checking for Docker version >= 1.6.0
INFO: Found Server version: 1.8.1
INFO: docker version: Client version: 1.6.0
INFO: docker version: Client API version: 1.18
INFO: docker version: Go version (client): go1.4.2
INFO: docker version: Git commit (client): 4749651
INFO: docker version: OS/Arch (client): linux/amd64
INFO: docker version: Server version: 1.8.1
INFO: docker version: Server API version: 1.20
INFO: docker version: Go version (server): go1.4.2
INFO: docker version: Git commit (server): d12ea79
INFO: docker version: OS/Arch (server): linux/amd64
INFO: docker info: Containers: 4
INFO: docker info: Images: 171
INFO: docker info: Storage Driver: aufs
INFO: docker info: Root Dir: /var/lib/docker/aufs
INFO: docker info: Backing Filesystem: extfs
INFO: docker info: Dirs: 188
INFO: docker info: Dirperm1 Supported: false
INFO: docker info: Execution Driver: native-0.2
INFO: docker info: Kernel Version: 3.13.0-63-generic
WARNING: No swap limit support
INFO: docker info: Operating System: Ubuntu precise (12.04.4 LTS)
INFO: docker info: CPUs: 4
INFO: docker info: Total Memory: 3.813 GiB
INFO: docker info: Name: poweredge320
INFO: docker info: ID: XXXXXXX
INFO: docker info: Http Proxy:
INFO: docker info: Https Proxy:
INFO: docker info: No Proxy:
INFO: Attempting to connect to: http://poweredge320.aics10.riken.jp:8000/v1
INFO: http://poweredge320.aics10.riken.jp:8000/v1 is accessible
INFO: Inspecting host capabilities
INFO: System: false
INFO: Host writable: true
INFO: Token: xxxxxxxx
INFO: Running registration
INFO: Printing Environment
INFO: ENV: CATTLE_ACCESS_KEY=XXXXXX
INFO: ENV: CATTLE_AGENT_IP=172.19.5.5
INFO: ENV: CATTLE_HOME=/var/lib/cattle
INFO: ENV: CATTLE_REGISTRATION_ACCESS_KEY=registrationToken
INFO: ENV: CATTLE_REGISTRATION_SECRET_KEY=xxxxxxx
INFO: ENV: CATTLE_SECRET_KEY=xxxxxxx
INFO: ENV: CATTLE_SYSTEMD=false
INFO: ENV: CATTLE_URL=http://poweredge320.aics10.riken.jp:8000/v1
INFO: ENV: DETECTED_CATTLE_AGENT_IP=172.17.42.1
INFO: ENV: RANCHER_AGENT_IMAGE=rancher/agent:v0.8.1
INFO: Launched Rancher Agent: XXXXXXX
Get http:///var/run/docker.sock/v1.18/images/aea8977b285c/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?

Of cause, /var/run/docker.sock exists on server and should be mounted into container.

I once again reinstalled server and was able to register both hosts.
The one on the same host with server I started with
-e CATTLE_AGENT_IP=172.19.5.5 option.
Now I see correct IP addresses in GUI and can ping containers between hosts.

As you suggested, I had to start second container on remote host for the ping to start working.

Thank you!
Peter

Just to follow-up.

When rancher/agent is launched we end up starting 3 containers, but only one ends up running. You can remove the docker named container with no issues, but we need to keep the rancher-agent-state container.

As for the rancher-agent container getting the “re-register this agent”, what kind of OS/server are you running? There is an existing workaround, but I have yet to reproduce not being able to re-run the agent command.

You can read about the open issue here: https://github.com/rancher/rancher/issues/1528

On both host machines I have Ubuntu 12.04.4 LTS (GNU/Linux 3.13.0-63-generic x86_64).

Hi,

I’ve got the same problem, i’ve got a stack started with rancher-compose. All works fine if all containers start on the same host, but If I add another Host, then I can’t manage make them communicates.

I’ve allowed UDP 500 & 4500 and tested them with nmap

I can’t ping or telnet if containers are not on the same host and i don’t really know how to debug it…
I don’t see any error logs in the agents.

My rancher components :smile:
Component
Version

    Rancher
    v0.38.0
    Cattle
    v0.94.0
    User Interface
    v0.49.0
    Rancher Compose
    beta/latest 

Thanks

sébastien

@Sebastien_Allamand

If you see a host on the UI with IP (172.17.42.1) or starting with 172.17.x.x, then please double check to see if the IP is the actual IP of the host. These IPs tend to be the docker internal IP and will not work. You will need to re-register your host with the correct IP. If you have issues re-registering and get the “re-register this agent” issue, please use the workaround.

Can you confirm that your hosts have the right IP? The above might happen if you have Rancher server on the same host as Rancher agent.

Hi @denise

Thanks for your response.

In the Rancher UI I see my 2 hosts with the public IP (same IP for both hosts) and not their private eth0 IP.
My Rancher server is on a third independant Host (where I have not included rancher agent).

Do you think I need to declare my host with my eth0 IP adderss instead of the public IP adress ?

If it has the same public IP, then yes, it would be better to use their private IP. You can just add the -e CATTLE_AGENT_IP=<private_ip> environment variable to the command to add the host.

thanks @denise, i think it way be the good way.

unfortunately today i’m not able to test it because my rancher-server has fill my 30G of data space on my server… i add a comment on this issue… https://github.com/rancher/rancher/issues/1676

is there a way to purge mysql ? or to export my data to launch another server ?

Yep, I’m trying to find the exact SQL statement that you could run to clean up your DB. I just realized that you’re using v0.38.0 and most likely have the clean up code. We’ll need to increase the frequency of your cleanup afterwards. Let’s get you up on v0.39.0 first.

hi @denise,

since my rancher has regain free space i have relaunch the agent with private IP and the overlay network now is working :smile: :smile:

I have made a small bench of request, and after a while I couldn’t reach my rancher-LB.
when scaling up my app the LB has reconfigured and works again, but the strange think was that there was no log error in the LB container is there a way to check sanity of the LB ?

thanks a lot for your help

Hi. I don’t know if my case is exatly identical, because I see other sympthoms.
Version: 0.39
Also, my cross-host networking (one webapp, one load balancer) doesn’t work.
My stack configuration:
docker-compose:

solr-webapp:
  restart: on-failure:5
  environment:
    INI: rancher
  external_links:
  - Default/solr-projects-external:solr-projects-external
  - Default/solr-publications-external:solr-publications-external
  tty: true
  image: uberresearch/solr_webapp:rancher
  stdin_open: true
webapp-lb:
  ports:
  - 80:6543
  restart: always
  tty: true
  image: rancher/load-balancer-service
  links:
  - solr-webapp:solr-webapp
  stdin_open: true

rancher-compose:

solr-webapp:
  scale: 1
webapp-lb:
  scale: 1
  load_balancer_config:
    name: webapp-lb config

Two things I wonder about:

  1. When looking at the graph, the load balancer is not linked to the service
  2. The webapp uses external services that are (no longer) displayed. I’m sure they were on an earlier install

My rancher master is running on a public subnet in a VPC (AWS) and the workers on the private subnet. I’ve tried adding the workers using the master public and private IP, the situation is the same. UDP port 500/4500 is open. Is this kind of configuration not feasible somehow?

Where should I look?

@Sebastien_Allamand - You can look at the haproxy config by following the instructions in our troubleshooting FAQS. We plan on expanding it in the next week or two based on issues that you’ve faced and others.

http://docs.rancher.com/rancher/faqs/troubleshooting/#how-can-i-see-the-configuration-of-my-load-balancer

@sdlarsen Typically the stack configuration has nothing to do with the actual issue of hosts not being able to communicate. Can you try logging into the network agent containers on one host and pinging the other network agent container?

How did your stack get created or where did you get this docker-compose.yml? When services are removed using the UI, it will typically update the docker-compose.yml that Rancher generates and remove those links.

When you refresh, can you see the links?

Hi @denise,

Thank you for your swift reply. The ping test showed a flaw in my network setup. Thank you for the suggestion and sorry for the noise.

By the way, is there any logging available that could have showed me this? Neither the UI nor rancher-compose complaints.

Br.
Søren

@sdlarsen Unfortunately, there is no where that really indicates that it’s not working. Due to the sheer number of people having similar networking issues, I’ve created https://github.com/rancher/rancher/issues/2222 as a feature request to be able to have rancher/server do some kind of checking and produce some kind of error message somewhere to make it easier to troubleshoot. :slight_smile:

@denise, excellent. Thank you.

Hello @denise,

I have a question, in order for the overlay network, does my private network for Hosts can be different ?
In my case I try to extend my network with hosts on 2 different datacenters, but network agent from datacenter different can’t ping each other (neither my containers)

in my case, big & medium are in same datacenter and can see each other, while lodz1 which is elsewhere can’t see others.

If you have any clue :wink:

sebastien

The agents will communicate with each other using the host ip that is shown in the UI/in your picture. So you are correct and containers on the 10. host won’t be able to see containers on the other 2, or vice versa.