Partial VPN Failure

Hey All,

I have a set of four hosts and thus any one host would normally have three VPN connections (or three pairs of SAs) to the other hosts. I have a situation where one host now only has two. I can’t see any obvious reason for this so I was hoping someone here could help out. I have a few questions and observations (IPs have been changed).

Q1. Where does the VPN configuration come from and where is it stored? I can’t find any file that seems to contain it on any host.

O1. Using swanctl --list-conns shows a conn-10.11.12.99 that obviously hasn’t been established. Using swanctl --initiate conn-10.11.12.99 just results in an initiate failed: missing configuration name error. Perhaps I’m not specifying the correct name or object though.

Q2. How do I get the missing SA back up safely? I know I could restart the agent on each end but I’d rather be more specific and have a lower impact.

O2. I’m seeing this in the /var/log/charon.log file on the host that is no longer a part of the VPN ‘mesh’:

Jul 18 06:00:23 11[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.196.94.48
Jul 18 06:00:23 11[JOB] CHILD_SA ESP/0x00000000/10.11.12.99 not found for delete

O3. I’m seeing this on the other hosts:

Jul 18 06:05:10 06[KNL] <conn-10.11.12.99|1759> querying policy failed: No such file or directory (2)
Jul 18 06:05:10 07[KNL] creating delete job for CHILD_SA ESP/0x00000000/10.11.12.99
e           :10 07[JOB] CHILD_SA ESP/0x00000000/10.196.94.46 not found for delet-
Jul 18 06:05:10 04[KNL] creating acquire job for policy 10.42.13.127/32[tcp/37322] === 10.42.158.78/32[tcp/http] with reqid {1234}
Jul 18 06:05:10 04[CFG] trap not found, unable to acquire reqid 1234
Jul 18 06:05:15 09[KNL] <conn-10.11.12.88|1760> querying policy failed: No such file or directory (2)

O4. I’m seeing this in the /var/log/rancher-net.log file on host that are still part of the VPN ‘mesh’:

time="2016-07-18T08:07:42Z" level=info msg="Added policy: {Dst:10.42.197.230/32 Src:10.42.0.0/16 Dir:dir out Priority:0 Index:0 Tmpls:[{Dst:10.11.12.99 Src:172.17.0.2 Proto:esp Mode:tunnel Reqid:1234}]}" 

O5. Output of the swanctl --stats command on the host no longer part of the VPN ‘mesh’:

uptime: 14 days, since Jul 04 01:00:11 2016
worker threads: 16 total, 11 idle, working: 4/0/1/0
job queues: 0/0/0/0
jobs scheduled: 6
IKE_SAs: 2 total, 0 half-open <<<SHOULD BE 3!
mallinfo: sbrk 2433024, mmap 0, used 333888, free 2099136

O6. Output of ip xfrm state on the same host, note the first SA which should be up but isn’t and has some weird values:

src 172.17.0.2 dst 10.11.12.99
    proto esp spi 0x00000000 reqid 1234 mode tunnel
    replay-window 0 
    sel src 10.42.24.153/32 dst 10.42.204.94/32 proto tcp sport 55510 dport 8080 dev eth0 
src 172.17.0.2 dst 10.11.12.88
    proto esp spi 0xc9bbb136 reqid 1234 mode tunnel
    replay-window 32 flag af-unspec
    auth-trunc hmac(sha1) 0x77d83c0e6bec06fba2795ceb2780ed4620ad3865 96
    enc cbc(aes) 0xc37aa496ce62520966c60a44d81b1483
    encap type espinudp sport 4500 dport 4500 addr 0.0.0.0

Having trawled through all this in order to post it, it seems clear now it’s a single SA that has disappeared, rather than all of them to this one host - just so you know.

Any ideas anyone please?

Note: Moved to Github here: https://github.com/rancher/rancher/issues/5463

Hey @sjiveson can you run the following in your network-agent: monit restart charon , does that resolve the issue?

1 Like

Answer to your Q1: /var/lib/cattle/etc/cattle/ipsec/config.json

Snip from the output of ps -ef

/var/lib/cattle/bin/rancher-net --log /var/log/rancher-net.log -f /var/lib/cattle/etc/cattle/ipsec/config.json -c /var/lib/cattle/etc/cattle/ipsec -i 172.17.0.2/16 --pid-file /var/run/rancher-net.pid --gcm=true
1 Like

Thanks both, much appreciated.

Yep, that did the trick in far less time that it took me to confirm at both ends.