IPSec network fails silently on a host

mvriel · February 9, 2017, 7:53am

Since we upgraded to Rancher 1.4 the IPSec network fails silently on random hosts at random intervals. As soon as we restart the IPSec stack the network connection comes back but this worries us quite a bit.

Is this a known issue?

svansteenis · February 9, 2017, 9:21am

Can you provide a bit more info?

How many machines, where are they located, what OS, what Docker version?
Logging from network related containers could be handy
Estimate of the “random intervals”, just to be able to reproduce

mvriel · February 9, 2017, 9:42am

Hi @svansteenis,

We are currently running 3 environments with approx. 10 machines each. Yesterday and today the IPSec connection became absent without the container crashing. I should have checked the logging of those machines but unfortunately didn’t and the machines have been cycled by now.

We noticed that IPSec was absent because we could not ping nor otherwise connect to any 10.42.* address on that host, from another within the same network, while other hosts were reachable.

The hosts are divided over three AWS VPCs with a Connection Peer between the rancher servers, which are on yet another VPC, and between eachother. Each VPC has a connection peer that opens UDP 4500 and 500 between them and port 22 from the rancher servers to the hosts.

We currently run Rancher in a HA setup with three hosts. All hosts are divided over eu-central-1a and eu-central-1b.

Servers run Rancher OS 0.6 with Docker 1.12.3; hosts run Ubuntu 14.04 with Docker 1.12.3 (no Rancher OS because we had recurring i/o wait issues with Overlay)

JerryVerhoef · February 10, 2017, 3:03pm

This morning we had the same issue again and I was able to grab some logs of the IPSec Router. Based on this log @svansteenis led me to a currently open issue: https://github.com/rancher/rancher/issues/7571 which might be related or the cause of our troubles.

aemneina · February 10, 2017, 3:36pm

this might be a job for @leodotcloud

leodotcloud · February 10, 2017, 4:59pm

I will debug this further.

JerryVerhoef · February 16, 2017, 1:22pm

Ha Leo,

If there is any extra information (extra logs, extra debug settigs etc) needed you can ask me. We still have regular problems and when we have a problem with one of our stacks the first thing we do now is reset the IPsec

JerryVerhoef · February 22, 2017, 9:32am

Some extra feedback:

Due to the problems with IPSec, we removed our RabbitMQ Cluster and our Eventstore cluster from the IPSec network and disabled their hosts from Rancher, disabled all rancher services on those machines. Since then we didn’t have any crashes of the IPSec and the whole rancher environment and our clusters are running smoothly again.

Small theory: Is it possible that constant traffic (cluster chatter) over the IPSec network is causing the problems?

salvax86 · March 9, 2017, 3:28pm

Hi I have the same problem. Today communication between two EC2-Servers went down.
Some feedback:

swanctl --list-sas: the connection between server 10.30.0.185 and 10.30.1.213 had no child and was in “CONNECTING”-State

swanctl --log shows much information but i could read about a “delete job” that cannot delete a child because it is not found
the other way: on server 10.30.1.213 there is no connection to 10.30.0.185. I can see with swanctl --list-conn that a connection should exist

Although on server 10.30.1.213 there were three connection on “Connecting” State with another server

This happens to me:

between kafka servers after 4-7 days (cluster chatter?)
between servers where prometheus fetch data from other servers

With restart of ipsec container connections are fine again for few days.

leodotcloud · March 10, 2017, 4:17pm

We are trying our best to get to the bottom of this. For ipsec connectivity we use strongswan and seems like there is some kind of race condition when the rekeying is happening (this happens every few hours for security reasons). If two ends of a tunnel initiate the process with an offset there is a possibility of duplicate SAs and strongswan is unable to detect this.

Related mail thread: https://lists.strongswan.org/pipermail/users/2012-October/003765.html

leodotcloud · March 17, 2017, 4:25am

@mvriel @salvax86 @JerryVerhoef if you have the setup in broken state (or next time if it happens), please find me on https://slack.rancher.io, I would like to collect some logs/info from the setup.

Also if you are using either CentOS/RHEL, please do share the steps to setup the host. (docker installation, storage etc)

kimgust · May 22, 2017, 7:09pm

Curious if anyone else has run into this since the last posting, or if this has been resolved in newer versions of Rancher? We are on Rancher v1.4.1 with Ubuntu 14.04.4 LTS hosts and Docker 1.12.3, in our own data center. We started hitting this last month when we switched over all of our production traffic to our Rancher cluster. Our production environment has 12 hosts. There appears to be a strong correlation with traffic.

When this does happen, we restart the ipsec service on the affected host. We set up monitors that test outgoing traffic at regular intervals from each node and get alerted immediately when it happens so we can go in and restart the ipsec service right away. Not ideal on the weekends of course.

I’d be happy to provide logs or get on Slack the next time it happens.

ryanwalls · May 22, 2017, 7:40pm

@kimgust We had similar problems with Ubuntu 14.04.4 LTS. We switched to Ubuntu 16 LTS about a month ago and the ipsec issues haven’t re-appeared yet.

kimgust · May 22, 2017, 9:01pm

Thanks @ryanwalls. What version of Docker and Rancher are you running?

ryanwalls · May 22, 2017, 11:05pm

@kimgust we are running Rancher 1.5.5 and Docker 1.12.6

salvax86 · May 23, 2017, 7:23am

@kimgust we have switched to vxlan. Since 2 months no issues with cross host communication

kimgust · June 9, 2017, 9:52pm

Fwiw, upgrading to Rancher 1.6.0 from 1.4.1 hasn’t made a difference.

ppiccolo · August 8, 2017, 5:21am

@leodotcloud Same here with rancher 1.6.2, docker 1.12.6 and Ubuntu 16.04 , 10 server, after some days some host lose the network communication with others… restarting ipsec bring all network up again.

Any suggestion ?

Thanks

leodotcloud · August 10, 2017, 6:38pm

@ppiccolo and others, can you please check if you are hitting this: https://github.com/rancher/rancher/issues/9377

docker exec -it $(docker ps | grep ipsec-router | awk '{print $1}') bash
cat /proc/net/xfrm_stat

and check for XfrmInStateSeqError.

nthomson · August 10, 2017, 9:58pm

Does this apply to Rancher 1.1.4? We’ve been hitting this issue frequently on AWS and have been restarting the Network Agent as suggested by others.

I have hosts in this state currently we’re leaving broken for troubleshooting. I checked with that command, and XfrmInStateSeqError has a value of 27. I checked several hosts that are unaffected and they all have values of 0.

Topic		Replies	Views
Intermittent Failure of Managed Network causing critical issues for some containers Rancher 1.x	1	723	April 9, 2017
IPSec not connecting in 3 Host Setup Rancher 1.x	6	2627	April 9, 2017
Rancher IPsec network does not work after router hardware issues Rancher 1.x	5	4668	November 6, 2017
Rancher IPSEC errors Rancher 1.x	1	1079	August 10, 2017
Occasionally, a container is unreachable by IP? Rancher 1.x	2	1422	September 26, 2016

IPSec network fails silently on a host

Related topics