Rancher v1.0.0 cross host networking fails

Howdy,

Recently upgraded to a working rancher 0.63 to v1.0.0 (GA)… First thing I noticed was, cross host networking was broken, and pings between network agents/containers on different hosts were failing. I decided to clean install both host servers (coreos-1, coreos-2 with latest versions), and also the rancher server instance. However, this still fails under clean install. Can ping containers on same host, but different hosts fail.

I noticed in host process list a bunch of spawned /etc/init.d/rancher-net start which seem to be growing in numbers… I shelled into the rancher-agent container, and took a look at around. You can see a loop happening, here is the full dump:

Probing around the init script for rancher-net, trying to run some of these ip xfrm commands fail with:

$ip xfrm state add src 1.1.1.1 dst 1.1.1.1 spi 42 proto esp mode tunnel aead “rfc4106(gcm(aes))” 0x0000000000000000000000000000000000000001 128 sel src 1.1.1.1 dst 1.1.1.1
RTNETLINK answers: Function not implemented

Iptables dump:
coreos-1 : http://pastebin.com/raw/0yDsp1Tt
coreos-2 : http://pastebin.com/raw/RchqBwg9

Thanks

Adding docker logs for one of the network-agent containers

http://pastebin.com/Qs4jwu8c

You can see some interesting errors

3/30/2016 11:02:21 AMINFO: Getting iptables
3/30/2016 11:02:22 AMINFO: Updating iptables
3/30/2016 11:02:22 AMINFO: Downloading http://rancher.[redacted]:8080/v1//configcontent//iptables current=
3/30/2016 11:02:22 AMINFO: Running /var/lib/cattle/download/iptables/iptables-3-bf64a7ed197a703a43d7e7f0579eec66fdcb1530ad911e47640f6a12688d7853/apply.sh
3/30/2016 11:02:22 AMSIOCSARP: Invalid argument
3/30/2016 11:02:22 AMarp: cannot set entry on line 3 of etherfile content-home/etc/cattle/ethers !

On the hosts that don’t have cross host networking, can you log into the Network agent and share the logs?

/var/log/rancher-net.log
/var/log/charon.log

I have noticed something like this, and hosts were dropping out of the Envionment; when I removed the Docker 1.10 Hosts then this problem went away.

I am using CoreOS, under VMWare; if you have the update set to ‘alpha’ it will auto update at the next boot and you get Docker 1.10.3… I have made sure I am using the Stable branch only now.

However, since moving up to 1.0, I have had trouble with Load Balancers not starting up and hanging in the Initialising state. This may be connected.

Logs for

/var/log/rancher-net.log
/var/log/charon.log

@josh We will be sending some time in the next couple of days to look at the networking issues in v1.0.0 and CoreOS. You can look at this GitHub request reported by @sshipway

https://github.com/rancher/rancher/issues/4238

I think I am seeing the same issue, not with CoreOS.

Out of the box with the latest release of server v1.0.0 and agent-instance v0.8.1, networking is not working for me. I cannot ping anything, including popular domain names like google.com, and the internal domain name (by service/container names).

I am not sure if this is an issue with the server instance or the agent-instance instance, but it is rendering me paralyzed because I cannot add new machines to Rancher and have working networking any longer. Existing machines with agent-instance:0.8.0` is still working normally.

@nlhkh You issue has been resolved with the fact that we don’t support ipv6, correct?

Hi,

Same problem here with CoreOS stable and docker 1.9.1, always working with rancher v0,59.1

I tried upgrading and a new instalation of rancher 1.0 and cross container links stopped working. Cannot ping linked containers in other machine neither ping outside CoreOs machines! Ping and DNS not working.

Revert to working environment.

On a side note, what distributions does the Rancher team QA against? I’m playing around with RancherOS (which solved the issue) as a possible coreos alternative.

We typically test against Ubuntu 14.04, Ubuntu 15.04, RHEL and CentOS 7, but we always use the latest Docker version (Docker 1.10.3).

When I tested with v1.0.0 and CoreOS Alpha (Docker 1.10.3), I did not see any networking issues, but I did see networking issues using CoreOS Beta/Stable which use Docker 1.9.1.

If you are having issues with networking and CoreOS, we recommend using the CoreOS version that has the latest Docker version (currently, that’s CoreOS Alpha).

I think I’m experiencing a similar problem to what’s been posted above. However, I’m running Oracle EL 7.2, Docker 1.10.3, Rancher 1.0.0.

I’ve managed to create a kubernetes environment, and add two hosts. The hosts have been active for several hours (I’m not seeing the issue where the hosts get stuck reconnecting). I’ve deployed some selenium containers and a selenium hub. The selenium containers that are running on the same node as the hub can register with the hub successfully. The selenium containers that are running on a different node fail to register. I’m seeing similar logging messages to what was posted by olds643 including “3/30/2016 11:02:22 AMarp: cannot set entry on line 3 of etherfile content-home/etc/cattle/ethers !.”

I have gotten a similar configuration to work in my environment using the centos scripts provided by kubernetes. In those scripts, they pass some arguments to the docker daemon like --ip-masq=false and --selinux-enabled=false as well as providing the bridge ip. I’m not sure if those apply with Rancher; I just followed the getting started tutorial.

Any help is appreciated!

EDIT 1:
I’m following the FAQ posted here. The IPs are correctly reported for the hosts. I’ve deployed an unbuntu container on each node, and pinging between the nodes has about a 50% success rate. The network agent is up and running on each node. The iptables CATTLE_PREROUTING doesn’t look anything like the FAQ example.

Chain CATTLE_PREROUTING (1 references)
num target prot opt source destination
1 MARK all – !10.42.0.0/16 169.254.169.250 MAC 02:19:03:94:F9:97 MARK set 0x9097
2 MARK all – !10.42.0.0/16 169.254.169.250 MAC 02:19:03:2F:B6:08 MARK set 0x4303
3 MARK all – !10.42.0.0/16 169.254.169.250 MAC 02:19:03:B5:FE:AA MARK set 0xf27e
4 MARK all – !10.42.0.0/16 169.254.169.250 MAC 02:19:03:FF:E5:21 MARK set 0x386e9

IPforward is 1 on all nodes

Exactly the same set of issues (broken network connectivity between containers within 10.42.x.x) with Rancher and CoreOS on Azure cloud:

"CATTLE_RANCHER_SERVER_IMAGE=v1.0.1"
"CATTLE_RANCHER_COMPOSE_VERSION=v0.7.4"
“CATTLE_CATTLE_VERSION=v0.159.7”

CoreOS stable 899.15.0

Docker version 1.9.1

High CPU usage I guess because of a large number of spawned processes:

root 65418 0.0 0.1 18396 2084 ? S 07:41 0:02 /bin/bash /etc/init.d/rancher-net start

and

root 54155 0.0 0.0 6568 724 ? R 10:22 0:00 ip xfrm state add src 1.1.1.1 dst 1.1.1.1 spi

At this stage we are aiming to rollback to previous release, is there any potential issue with this ?

Same problem here, identical environment CoreOS 899.15.0 + Rancher 1.0.

Any workaround available?

@xian The workaround we suggest is to use a CoreOS version that has Docker v1.10.3, which is currently CoreOS Alpha (1010.1.0 ).

Ok, thanks, will give it a try.

Any insight about what might cause this issue ?