Rancher v1.0.0 cross host networking fails

olds463 · March 30, 2016, 3:22pm

Howdy,

Recently upgraded to a working rancher 0.63 to v1.0.0 (GA)… First thing I noticed was, cross host networking was broken, and pings between network agents/containers on different hosts were failing. I decided to clean install both host servers (coreos-1, coreos-2 with latest versions), and also the rancher server instance. However, this still fails under clean install. Can ping containers on same host, but different hosts fail.

I noticed in host process list a bunch of spawned /etc/init.d/rancher-net start which seem to be growing in numbers… I shelled into the rancher-agent container, and took a look at around. You can see a loop happening, here is the full dump:

pastebin.com

http://pastebin.com/raw/gvpmm2vX

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   4932  1632 ?        Ss   15:02   0:00 init  
root       328  0.0  0.2 120668  8996 ?        Sl   15:02   0:00 /var/lib/cattle/bin/rancher-metadata -log /var/log/rancher-metadata.log -answers /var/lib/cattle/etc/cattle/
root       400  0.0  0.2 202996  9484 ?        Sl   15:02   0:00 /var/lib/cattle/bin/rancher-dns -log /var/log/rancher-dns.log -answers /var/lib/cattle/etc/cattle/dns/answer
root       781  0.2  0.1 104652  4056 ?        Ssl  15:02   0:02 /usr/bin/monit -Ic /etc/monit/monitrc
root       820  0.5  0.0  18324  3240 ?        S    15:02   0:04 /bin/bash /etc/init.d/rancher-net start
root       865  3.5  0.0  18144  3048 ?        Ss   15:17   0:00 bash
root       900  0.0  0.0   4348   696 ?        S    15:17   0:00 sleep 1
root       902  0.0  0.0   4348   648 ?        S    15:17   0:00 sleep 1
root       903  0.0  0.0   4348   672 ?        S    15:17   0:00 sleep 1

This paste has been truncated. show original

Probing around the init script for rancher-net, trying to run some of these ip xfrm commands fail with:

$ip xfrm state add src 1.1.1.1 dst 1.1.1.1 spi 42 proto esp mode tunnel aead “rfc4106(gcm(aes))” 0x0000000000000000000000000000000000000001 128 sel src 1.1.1.1 dst 1.1.1.1
RTNETLINK answers: Function not implemented

Iptables dump:
coreos-1 : http://pastebin.com/raw/0yDsp1Tt
coreos-2 : http://pastebin.com/raw/RchqBwg9

Thanks

olds463 · March 30, 2016, 3:45pm

Adding docker logs for one of the network-agent containers

You can see some interesting errors

3/30/2016 11:02:21 AMINFO: Getting iptables
3/30/2016 11:02:22 AMINFO: Updating iptables
3/30/2016 11:02:22 AMINFO: Downloading http://rancher.[redacted]:8080/v1//configcontent//iptables current=
3/30/2016 11:02:22 AMINFO: Running /var/lib/cattle/download/iptables/iptables-3-bf64a7ed197a703a43d7e7f0579eec66fdcb1530ad911e47640f6a12688d7853/apply.sh
3/30/2016 11:02:22 AMSIOCSARP: Invalid argument
3/30/2016 11:02:22 AMarp: cannot set entry on line 3 of etherfile content-home/etc/cattle/ethers !

denise · March 30, 2016, 9:51pm

On the hosts that don’t have cross host networking, can you log into the Network agent and share the logs?

/var/log/rancher-net.log
/var/log/charon.log

sshipway · March 31, 2016, 3:26am

I have noticed something like this, and hosts were dropping out of the Envionment; when I removed the Docker 1.10 Hosts then this problem went away.

I am using CoreOS, under VMWare; if you have the update set to ‘alpha’ it will auto update at the next boot and you get Docker 1.10.3… I have made sure I am using the Stable branch only now.

However, since moving up to 1.0, I have had trouble with Load Balancers not starting up and hanging in the Initialising state. This may be connected.

olds463 · March 31, 2016, 12:41pm

http://pastebin.com/raw/9zA0vkFG

Logs for

/var/log/rancher-net.log
/var/log/charon.log

denise · March 31, 2016, 6:39pm

@josh We will be sending some time in the next couple of days to look at the networking issues in v1.0.0 and CoreOS. You can look at this GitHub request reported by @sshipway

https://github.com/rancher/rancher/issues/4238

nlhkh · April 2, 2016, 4:21am

I think I am seeing the same issue, not with CoreOS.

Out of the box with the latest release of server v1.0.0 and agent-instance v0.8.1, networking is not working for me. I cannot ping anything, including popular domain names like google.com, and the internal domain name (by service/container names).

I am not sure if this is an issue with the server instance or the agent-instance instance, but it is rendering me paralyzed because I cannot add new machines to Rancher and have working networking any longer. Existing machines with agent-instance:0.8.0` is still working normally.

denise · April 4, 2016, 6:18pm

@nlhkh You issue has been resolved with the fact that we don’t support ipv6, correct?

Carlos_Silva · April 6, 2016, 4:43pm

Hi,

Same problem here with CoreOS stable and docker 1.9.1, always working with rancher v0,59.1

I tried upgrading and a new instalation of rancher 1.0 and cross container links stopped working. Cannot ping linked containers in other machine neither ping outside CoreOs machines! Ping and DNS not working.

Revert to working environment.

olds463 · April 6, 2016, 5:19pm

On a side note, what distributions does the Rancher team QA against? I’m playing around with RancherOS (which solved the issue) as a possible coreos alternative.

denise · April 6, 2016, 5:34pm

We typically test against Ubuntu 14.04, Ubuntu 15.04, RHEL and CentOS 7, but we always use the latest Docker version (Docker 1.10.3).

When I tested with v1.0.0 and CoreOS Alpha (Docker 1.10.3), I did not see any networking issues, but I did see networking issues using CoreOS Beta/Stable which use Docker 1.9.1.

denise · April 6, 2016, 5:35pm

If you are having issues with networking and CoreOS, we recommend using the CoreOS version that has the latest Docker version (currently, that’s CoreOS Alpha).

5f6b3fb8 · April 12, 2016, 10:22pm

I think I’m experiencing a similar problem to what’s been posted above. However, I’m running Oracle EL 7.2, Docker 1.10.3, Rancher 1.0.0.

I’ve managed to create a kubernetes environment, and add two hosts. The hosts have been active for several hours (I’m not seeing the issue where the hosts get stuck reconnecting). I’ve deployed some selenium containers and a selenium hub. The selenium containers that are running on the same node as the hub can register with the hub successfully. The selenium containers that are running on a different node fail to register. I’m seeing similar logging messages to what was posted by olds643 including “3/30/2016 11:02:22 AMarp: cannot set entry on line 3 of etherfile content-home/etc/cattle/ethers !.”

I have gotten a similar configuration to work in my environment using the centos scripts provided by kubernetes. In those scripts, they pass some arguments to the docker daemon like --ip-masq=false and --selinux-enabled=false as well as providing the bridge ip. I’m not sure if those apply with Rancher; I just followed the getting started tutorial.

Any help is appreciated!

EDIT 1:
I’m following the FAQ posted here. The IPs are correctly reported for the hosts. I’ve deployed an unbuntu container on each node, and pinging between the nodes has about a 50% success rate. The network agent is up and running on each node. The iptables CATTLE_PREROUTING doesn’t look anything like the FAQ example.

Chain CATTLE_PREROUTING (1 references)
num target prot opt source destination
1 MARK all – !10.42.0.0/16 169.254.169.250 MAC 02:19:03:94:F9:97 MARK set 0x9097
2 MARK all – !10.42.0.0/16 169.254.169.250 MAC 02:19:03:2F:B6:08 MARK set 0x4303
3 MARK all – !10.42.0.0/16 169.254.169.250 MAC 02:19:03:B5:FE:AA MARK set 0xf27e
4 MARK all – !10.42.0.0/16 169.254.169.250 MAC 02:19:03:FF:E5:21 MARK set 0x386e9

IPforward is 1 on all nodes

drbolsen · April 15, 2016, 12:24am

Exactly the same set of issues (broken network connectivity between containers within 10.42.x.x) with Rancher and CoreOS on Azure cloud:

"CATTLE_RANCHER_SERVER_IMAGE=v1.0.1"
"CATTLE_RANCHER_COMPOSE_VERSION=v0.7.4"
“CATTLE_CATTLE_VERSION=v0.159.7”

CoreOS stable 899.15.0

Docker version 1.9.1

High CPU usage I guess because of a large number of spawned processes:

root 65418 0.0 0.1 18396 2084 ? S 07:41 0:02 /bin/bash /etc/init.d/rancher-net start

and

root 54155 0.0 0.0 6568 724 ? R 10:22 0:00 ip xfrm state add src 1.1.1.1 dst 1.1.1.1 spi

At this stage we are aiming to rollback to previous release, is there any potential issue with this ?

xian · April 18, 2016, 4:54pm

Same problem here, identical environment CoreOS 899.15.0 + Rancher 1.0.

Any workaround available?

denise · April 19, 2016, 4:12am

@xian The workaround we suggest is to use a CoreOS version that has Docker v1.10.3, which is currently CoreOS Alpha (1010.1.0 ).

xian · April 19, 2016, 9:09am

Ok, thanks, will give it a try.

drbolsen · April 21, 2016, 11:30pm

Any insight about what might cause this issue ?

Topic		Replies	Views
Cross-host intercontainer communication trouble Rancher 1.x	27	12112	May 14, 2016
Cross host networking trouble Rancher 1.x	3	1073	August 25, 2015
Cross Host Networking Issues Rancher 1.x	3	1923	April 9, 2017
Cross Host network communication failure Rancher 1.x	9	4028	December 2, 2015
Rancher vagrant hosts disconnect are network agent start Rancher 1.x	0	907	February 7, 2016

Rancher v1.0.0 cross host networking fails

Related topics