How to get Rancher overlay network working locally

I followed the steps here [1] to get a local Rancher server up and running. Having got a single node up and registered with Rancher as explained, I repeated the step to have a second host (VM).

I then started a service on each box that listened on an exposed port (visible via that 192.168.99 network), but could not access that port via the 10.42 network. Or rather, I could access the local port, but not the one on the other node.

I’ve read the details here [2] on cross host communication. In my setup, I don’t just have a rancher/agent-instance:v0.8.1 container, I have a rancher/agent:v1.0.1 container too, and I only see the iptables rules inside the second container, which differs from what is described on that page.

I also see, in the logs of the rancher/agent-instance the same issue as is reported here [3], but I don’t know if that is particularly an issue, because the next command in the apply.sh script executes successfully.

I also see the same issue when deploying Rancher on EC2. Any ideas what I might be missing when it comes to getting the overlay network to actually work?

Many thanks,

Upayavira

[1] http://rancher.com/running-rancher-on-a-laptop/
[2] http://docs.rancher.com/rancher/latest/en/faqs/troubleshooting/
[3] https://github.com/rancher/rancher/issues/4546

@Upayavira:
Few things to check first:

  • What OS and kernel version are you running?
  • Are the UDP ports 500 and 4500 opened in the firewalls?
  • Did you launch the services on the hosts directly using docker run command or through GUI?
  • Can you check the ip addresses inside of the container of your service?
  • Can you check the ipsec/charon logs and see if there are any errors?

Regarding the agents: There are two agents running on the host. The “rancher/agent-instance” is the Network Agent, which gets deployed when a service is deployed. The “rancher/instance” is deployed when the host is added.

  1. What OS version?
  • boot2docker v1.11.1
  1. UDP ports 500 and 4500 opened?
  • how would I know? I haven’t opened them specifically, I assume they will be opened automatically.
    I’ve seen rules listed in the iptables configs, but don’t know whether the rules I see are right.
  1. How did you launch the containers?
  • I launch services via the UI and via rancher-compose
  1. Can I check the IP addresses inside the container?
  • what do you mean here? Where should I check?
  1. Can you check the ipsec/charon logs?
  • /var/log/charon.log was interesting inside the agent-instance container. It showed that connections are successfully being made between the two hosts (via 192.168.99.100/101 IP addresses)

Any ideas what to try next?

I’ve NEVER seen inter-host networking work on ANY rancher setup I’ve seen. And these have been set up by three independent people. It seems inter-host networking with the overlay network is just supposed to work, but that isn’t happening for me on my local system, nor on Rancher setups for two independent organisations.

What is going on, and how can I get it resolved? Without this, Rancher becomes a heap harder to use.

Does anyone have ideas how to work out why inter-host networking on the overlay (10.42) network is not responding?

@Upayavira: Sorry to hear that you are having a problem. I have been able to deploy catalog items across different hosts and work totally fine. We are aware of some of the cross-host communication issues and we are trying to root cause them. Looking at the information available from the logs, we have not been able to pin point the exact reason and hence the delay. We are definitely working towards resolving this issue.

Okay, thx Leo. I’m happy to help debug this if I can, but it sounds like you have reproduced it already. Is there a ticket/bug/issue where you are tracking it? Thx again.

@Upayavira: No, I haven’t been able to reproduce the issue. Do you have any script that will help me reproduce the issue at my end? Or the exact steps? I really want to get to the bottom of this.

Here’s what I did:

I started with this blog post: http://rancher.com/running-rancher-on-a-laptop/

I ended up with Docker 1.10.2 and Rancher server v1.1.0-dev3. The Rancher agent is rancher/agent-instance:v0.8.1.

Having got a local Rancher VM up, and a single other node, I just added a second VM within VirtualBox on my Mac, using the docker-machine command listed on that page. Having connected that VM to Rancher, containers running on the other host could not access containers on the original docker host. I can give fuller details if you need more.

Or I can keep digging on my own system if that helps.

Thinking this through. The network agent can ping/access a container on the same host. So traffic is getting routed correctly from containers to the network agent and back. But it isn’t routing between hosts. The routing between hosts is handled by StrongSwan it seems. So, looking into /var/log/charon.log, if I try to ping the other side (10.42.1.89 on this occasion), I see these log lines appear:

Jun 9 22:32:43 10[KNL] creating acquire job for policy 10.42.12.175/32[udp/41159] === 10.42.1.89/32[udp/1025] with reqid {1234}
Jun 9 22:32:43 10[CFG] trap not found, unable to acquire reqid 1234

I did try asking Google what it made of the above error, but wasn’t able to make sense of it in the time I gave it, as all references I found to that error seemed at least 4 years old.

@Upayavira:
I have an update regarding this issue.
According to https://wiki.strongswan.org/issues/183 we need the kernel to be compiled with CONFIG_INET_XFRM_MODE_TRANSPORT set. If you look at the boot2docker kernel config(/proc/config.gz), you will notice that this flag is not set and commented out. Hence strongswan is not able to push the CHILD_SA even though the tunnel is established.

In short, cross host communication will not work using boot2docker image.

So to get rancher to work properly on your laptop, try using some other OS. I have tried installing Ubuntu (14.04.4 LTS) and it worked totally fine. For Ubuntu, the kernel config info is available at a different place (/boot/config-*).

So please create different Virtualbox VMs using a different OS, start rancher server, agents and test it out. I am sure you will see the cross host communication working.

Let me know how it goes.

Leo,

Thanks for this. I’ve been trying for a few days to find an easy way to create a new VirtualBox VM, which has turned out to be harder than expected. But I will persist and get this working. If I make a VM that is worth having, I’ll make an image public.

Reading this [1] made me understand how to build a boot2docker image.

I took the following steps:

  • cloned the boot2docker repo
  • switched to the v1.10.3 tag
  • edited the kernel_config to set CONFIG_INET_XFRM_MODE_TRANSPORT=yes
  • built a new boot2docker.iso
  • started two VMs using this (host3/host4)
  • extended my Solr service to run on hosts 3/4

When I attempted to ping the host4 network agent from the host3 network agent, I still got no response. Likewise, the Solr instances cannot see each other on their normal port 8983.

Any other ideas?

[1] https://github.com/boot2docker/boot2docker/blob/master/doc/BUILD.md

One last attempt. I read the post you referred to. I read carefully the configs that the poster had made, and noticed that he had CONFIG_INET_XFRM_MODE_TUNNEL=yes also. I set that and tried it again, and YAY!!! It worked!! I’m now gonna switch back to latest master and try again. If that works, I’ll post a boot2docker.img file somewhere, and make a PR against boot2docker so that this can work in the future with less pain for others.

Here’s a post about it: http://www.odoko.co.uk/local-rancher-with-overlay-networking/

It includes a reference to a boot2docker.iso in which overlay networking will behave correctly. Thanks for your help Leo!

4 Likes