IPSec network fails silently on a host

Since we upgraded to Rancher 1.4 the IPSec network fails silently on random hosts at random intervals. As soon as we restart the IPSec stack the network connection comes back but this worries us quite a bit.

Is this a known issue?

3 Likes

Can you provide a bit more info?

  • How many machines, where are they located, what OS, what Docker version?
  • Logging from network related containers could be handy
  • Estimate of the “random intervals”, just to be able to reproduce

Hi @svansteenis,

We are currently running 3 environments with approx. 10 machines each. Yesterday and today the IPSec connection became absent without the container crashing. I should have checked the logging of those machines but unfortunately didn’t and the machines have been cycled by now.

We noticed that IPSec was absent because we could not ping nor otherwise connect to any 10.42.* address on that host, from another within the same network, while other hosts were reachable.

The hosts are divided over three AWS VPCs with a Connection Peer between the rancher servers, which are on yet another VPC, and between eachother. Each VPC has a connection peer that opens UDP 4500 and 500 between them and port 22 from the rancher servers to the hosts.

We currently run Rancher in a HA setup with three hosts. All hosts are divided over eu-central-1a and eu-central-1b.

Servers run Rancher OS 0.6 with Docker 1.12.3; hosts run Ubuntu 14.04 with Docker 1.12.3 (no Rancher OS because we had recurring i/o wait issues with Overlay)

This morning we had the same issue again and I was able to grab some logs of the IPSec Router. Based on this log @svansteenis led me to a currently open issue: https://github.com/rancher/rancher/issues/7571 which might be related or the cause of our troubles.

this might be a job for @leodotcloud

I will debug this further.

Ha Leo,

If there is any extra information (extra logs, extra debug settigs etc) needed you can ask me. We still have regular problems and when we have a problem with one of our stacks the first thing we do now is reset the IPsec :confused:

Some extra feedback:

Due to the problems with IPSec, we removed our RabbitMQ Cluster and our Eventstore cluster from the IPSec network and disabled their hosts from Rancher, disabled all rancher services on those machines. Since then we didn’t have any crashes of the IPSec and the whole rancher environment and our clusters are running smoothly again.

Small theory: Is it possible that constant traffic (cluster chatter) over the IPSec network is causing the problems?

Hi I have the same problem. Today communication between two EC2-Servers went down.
Some feedback:

  • swanctl --list-sas: the connection between server 10.30.0.185 and 10.30.1.213 had no child and was in “CONNECTING”-State

  • swanctl --log shows much information but i could read about a “delete job” that cannot delete a child because it is not found

  • the other way: on server 10.30.1.213 there is no connection to 10.30.0.185. I can see with swanctl --list-conn that a connection should exist

  • Although on server 10.30.1.213 there were three connection on “Connecting” State with another server

This happens to me:

  • between kafka servers after 4-7 days (cluster chatter?)
  • between servers where prometheus fetch data from other servers

With restart of ipsec container connections are fine again for few days.

1 Like

We are trying our best to get to the bottom of this. For ipsec connectivity we use strongswan and seems like there is some kind of race condition when the rekeying is happening (this happens every few hours for security reasons). If two ends of a tunnel initiate the process with an offset there is a possibility of duplicate SAs and strongswan is unable to detect this.

Related mail thread: https://lists.strongswan.org/pipermail/users/2012-October/003765.html

1 Like

@mvriel @salvax86 @JerryVerhoef if you have the setup in broken state (or next time if it happens), please find me on https://slack.rancher.io, I would like to collect some logs/info from the setup.

Also if you are using either CentOS/RHEL, please do share the steps to setup the host. (docker installation, storage etc)

Curious if anyone else has run into this since the last posting, or if this has been resolved in newer versions of Rancher? We are on Rancher v1.4.1 with Ubuntu 14.04.4 LTS hosts and Docker 1.12.3, in our own data center. We started hitting this last month when we switched over all of our production traffic to our Rancher cluster. Our production environment has 12 hosts. There appears to be a strong correlation with traffic.

When this does happen, we restart the ipsec service on the affected host. We set up monitors that test outgoing traffic at regular intervals from each node and get alerted immediately when it happens so we can go in and restart the ipsec service right away. Not ideal on the weekends of course.

I’d be happy to provide logs or get on Slack the next time it happens.

@kimgust We had similar problems with Ubuntu 14.04.4 LTS. We switched to Ubuntu 16 LTS about a month ago and the ipsec issues haven’t re-appeared yet.

Thanks @ryanwalls. What version of Docker and Rancher are you running?

@kimgust we are running Rancher 1.5.5 and Docker 1.12.6

@kimgust we have switched to vxlan. Since 2 months no issues with cross host communication

Fwiw, upgrading to Rancher 1.6.0 from 1.4.1 hasn’t made a difference.

@leodotcloud Same here with rancher 1.6.2, docker 1.12.6 and Ubuntu 16.04 , 10 server, after some days some host lose the network communication with others… restarting ipsec bring all network up again.

Any suggestion ?

Thanks

@ppiccolo and others, can you please check if you are hitting this: https://github.com/rancher/rancher/issues/9377

docker exec -it $(docker ps | grep ipsec-router | awk '{print $1}') bash
cat /proc/net/xfrm_stat

and check for XfrmInStateSeqError.

Does this apply to Rancher 1.1.4? We’ve been hitting this issue frequently on AWS and have been restarting the Network Agent as suggested by others.

I have hosts in this state currently we’re leaving broken for troubleshooting. I checked with that command, and XfrmInStateSeqError has a value of 27. I checked several hosts that are unaffected and they all have values of 0.