IPSec network fails silently on a host

This morning we had the same issue again and I was able to grab some logs of the IPSec Router. Based on this log @svansteenis led me to a currently open issue: https://github.com/rancher/rancher/issues/7571 which might be related or the cause of our troubles.

this might be a job for @leodotcloud

I will debug this further.

Ha Leo,

If there is any extra information (extra logs, extra debug settigs etc) needed you can ask me. We still have regular problems and when we have a problem with one of our stacks the first thing we do now is reset the IPsec :confused:

Some extra feedback:

Due to the problems with IPSec, we removed our RabbitMQ Cluster and our Eventstore cluster from the IPSec network and disabled their hosts from Rancher, disabled all rancher services on those machines. Since then we didn’t have any crashes of the IPSec and the whole rancher environment and our clusters are running smoothly again.

Small theory: Is it possible that constant traffic (cluster chatter) over the IPSec network is causing the problems?

Hi I have the same problem. Today communication between two EC2-Servers went down.
Some feedback:

  • swanctl --list-sas: the connection between server 10.30.0.185 and 10.30.1.213 had no child and was in “CONNECTING”-State

  • swanctl --log shows much information but i could read about a “delete job” that cannot delete a child because it is not found

  • the other way: on server 10.30.1.213 there is no connection to 10.30.0.185. I can see with swanctl --list-conn that a connection should exist

  • Although on server 10.30.1.213 there were three connection on “Connecting” State with another server

This happens to me:

  • between kafka servers after 4-7 days (cluster chatter?)
  • between servers where prometheus fetch data from other servers

With restart of ipsec container connections are fine again for few days.

1 Like

We are trying our best to get to the bottom of this. For ipsec connectivity we use strongswan and seems like there is some kind of race condition when the rekeying is happening (this happens every few hours for security reasons). If two ends of a tunnel initiate the process with an offset there is a possibility of duplicate SAs and strongswan is unable to detect this.

Related mail thread: https://lists.strongswan.org/pipermail/users/2012-October/003765.html

1 Like

@mvriel @salvax86 @JerryVerhoef if you have the setup in broken state (or next time if it happens), please find me on https://slack.rancher.io, I would like to collect some logs/info from the setup.

Also if you are using either CentOS/RHEL, please do share the steps to setup the host. (docker installation, storage etc)

Curious if anyone else has run into this since the last posting, or if this has been resolved in newer versions of Rancher? We are on Rancher v1.4.1 with Ubuntu 14.04.4 LTS hosts and Docker 1.12.3, in our own data center. We started hitting this last month when we switched over all of our production traffic to our Rancher cluster. Our production environment has 12 hosts. There appears to be a strong correlation with traffic.

When this does happen, we restart the ipsec service on the affected host. We set up monitors that test outgoing traffic at regular intervals from each node and get alerted immediately when it happens so we can go in and restart the ipsec service right away. Not ideal on the weekends of course.

I’d be happy to provide logs or get on Slack the next time it happens.

@kimgust We had similar problems with Ubuntu 14.04.4 LTS. We switched to Ubuntu 16 LTS about a month ago and the ipsec issues haven’t re-appeared yet.

Thanks @ryanwalls. What version of Docker and Rancher are you running?

@kimgust we are running Rancher 1.5.5 and Docker 1.12.6

@kimgust we have switched to vxlan. Since 2 months no issues with cross host communication

Fwiw, upgrading to Rancher 1.6.0 from 1.4.1 hasn’t made a difference.

@leodotcloud Same here with rancher 1.6.2, docker 1.12.6 and Ubuntu 16.04 , 10 server, after some days some host lose the network communication with others… restarting ipsec bring all network up again.

Any suggestion ?

Thanks

@ppiccolo and others, can you please check if you are hitting this: https://github.com/rancher/rancher/issues/9377

docker exec -it $(docker ps | grep ipsec-router | awk '{print $1}') bash
cat /proc/net/xfrm_stat

and check for XfrmInStateSeqError.

Does this apply to Rancher 1.1.4? We’ve been hitting this issue frequently on AWS and have been restarting the Network Agent as suggested by others.

I have hosts in this state currently we’re leaving broken for troubleshooting. I checked with that command, and XfrmInStateSeqError has a value of 27. I checked several hosts that are unaffected and they all have values of 0.

@nthomson 1.1 is over a year old and the way networking works has been completely redone to use CNI drivers since then. Please upgrade to a modern version, we don’t even support 1.1 for paying customers anymore.

@leodotcloud replied to github issue https://github.com/rancher/rancher/issues/9377

thanks

@vincent We previously tried to upgrade to Rancher 1.6.0 and ran in to issues that are potentially similar in nature to whats being outlined here. In the case of 1.6, the IPSec network appears to be having connectivity issues, which in turns causes container health checks to constantly fail and be recycled (including the health check containers themselves). I plan to revisit this exercise soon and will post or create an issue if we continue to see issues with more information.