Rancher IPsec network does not work after router hardware issues

Hello,

when the datacenter company that hosts our servers reports router issues, the affected servers’ Rancher IPsec networks never recover. The three servers we are running per environment are in different datacenters, so two servers can still communicate to each other but the third one (affected host) cannot. Rebooting the affected server results in all services returning to normal.

We are using Rancher 1.6.7 with the managed network and three hosts per environment using cattle. The hosts are root servers running Ubuntu 16.04.3 LTS with Docker 17.06.1 CE. We did upgrade infrastructure stacks when we upgraded to Rancher 1.6.7 and are running rancher/net:v0.11.7 IPsec cni-driver and router.

I hope we can get this resolved without rebooting our servers every time. :slight_smile:

PS: Recreating the IPsec service containers probably would work as well but I have not tried this yet because the affected host does not need 100% uptime and I would like to investigate this further.

Things I noticed:
The IPsec infrastructure service on the affected host has the following extra log entry in stderr that the other two do not have:

9/7/2017 11:29:25 AM Refer to router sidekick for logs
9/7/2017 11:29:25 AM mkfifo: cannot create fifo 'f': File exists

IPsec router has the following entries on the affected host. These entries repeat several times every 2-5 minutes:

9/8/2017 12:54:02 PM08[KNL] creating acquire job for policy 10.42.32.137/32[6/55476] === 10.42.182.122/32[6/27017] with reqid {1234}
9/8/2017 12:54:02 PM08[CFG] trap not found, unable to acquire reqid 1234
9/8/2017 12:54:02 PM05[KNL] creating delete job for CHILD_SA ESP/0x00000000/<unaffected host2's public IP>
9/8/2017 12:54:02 PM05[JOB] CHILD_SA ESP/0x00000000/<unaffected host2's public IP> not found for delete
9/8/2017 12:54:02 PM15[KNL] creating acquire job for policy 10.42.201.228/32[6/37098] === 10.42.173.57/32[6/80] with reqid {1234}
9/8/2017 12:54:02 PM15[CFG] trap not found, unable to acquire reqid 1234
9/8/2017 12:54:02 PM16[KNL] creating delete job for CHILD_SA ESP/0x00000000/<unaffected host1's public IP>
9/8/2017 12:54:02 PM06[JOB] CHILD_SA ESP/0x00000000/<unaffected host1's public IP> not found for delete
9/8/2017 12:54:02 PM04[KNL] creating acquire job for policy 10.42.101.69/32[6/45564] === 10.42.182.122/32[6/27017] with reqid {1234}
9/8/2017 12:54:02 PM04[CFG] trap not found, unable to acquire reqid 1234

I guess it cannot execute this job because there is no IPsec connection to the other servers from this affected server.

ip route get <other host's 10.42.0.0/16 IP in the same environment> returns:

<that other host's 10.42.0.0/16 IP in the same environment> dev eth0  src <this host's 10.42.0.0/16 IP>

So that seems correct. docker inspect also shows the io.rancher.container.ip label with this host’s 10.42.0.0/16 IP.

Maybe this is related to https://github.com/rancher/rancher/issues/9377 although the cause seems different. Doing docker exec -it $(docker ps | grep ipsec-router | awk '{print $1}') bash cat /proc/net/xfrm_stat on the affected host results in:

XfrmInError             	0
XfrmInBufferError       	0
XfrmInHdrError          	0
XfrmInNoStates          	0
XfrmInStateProtoError   	0
XfrmInStateModeError    	0
XfrmInStateSeqError     	89
XfrmInStateExpired      	0
XfrmInStateMismatch     	0
XfrmInStateInvalid      	0
XfrmInTmplMismatch      	0
XfrmInNoPols            	0
XfrmInPolBlock          	0
XfrmInPolError          	0
XfrmOutError            	0
XfrmOutBundleGenError   	0
XfrmOutBundleCheckError 	0
XfrmOutNoStates         	233654
XfrmOutStateProtoError  	0
XfrmOutStateModeError   	0
XfrmOutStateSeqError    	0
XfrmOutStateExpired     	0
XfrmOutPolBlock         	0
XfrmOutPolDead          	0
XfrmOutPolError         	0
XfrmFwdHdrError         	0
XfrmOutStateInvalid     	69
XfrmAcquireError        	1

Any ideas?

Update:
Restarting the IPsec container (the one which uses the rancher/net:holder image) restores connectivity. I am now trying to setup a healthcheck for the IPsec container but have so far been unsuccessful.

Update2:
We are now using health check for IPsec with TCP Port Open 8111 which seems to work for now. Although this situation is hard to replicate in a testing environment. I guess time will tell.

We are working on a health check service for IPSec: https://github.com/rancher/rancher/issues/8941

@dpaar The symptoms kind of indicate the problem could be due to this issue: https://github.com/rancher/rancher/issues/9971 and this has been fixed, will be available in in 1.6.11 release. Please give it a try and let us know. Feel free to create a github issue next time, closing this thread for now.