Hello,
when the datacenter company that hosts our servers reports router issues, the affected servers’ Rancher IPsec networks never recover. The three servers we are running per environment are in different datacenters, so two servers can still communicate to each other but the third one (affected host) cannot. Rebooting the affected server results in all services returning to normal.
We are using Rancher 1.6.7 with the managed network and three hosts per environment using cattle. The hosts are root servers running Ubuntu 16.04.3 LTS with Docker 17.06.1 CE. We did upgrade infrastructure stacks when we upgraded to Rancher 1.6.7 and are running rancher/net:v0.11.7 IPsec cni-driver and router.
I hope we can get this resolved without rebooting our servers every time.
PS: Recreating the IPsec service containers probably would work as well but I have not tried this yet because the affected host does not need 100% uptime and I would like to investigate this further.
Things I noticed:
The IPsec infrastructure service on the affected host has the following extra log entry in stderr that the other two do not have:
9/7/2017 11:29:25 AM Refer to router sidekick for logs
9/7/2017 11:29:25 AM mkfifo: cannot create fifo 'f': File exists
IPsec router has the following entries on the affected host. These entries repeat several times every 2-5 minutes:
9/8/2017 12:54:02 PM08[KNL] creating acquire job for policy 10.42.32.137/32[6/55476] === 10.42.182.122/32[6/27017] with reqid {1234}
9/8/2017 12:54:02 PM08[CFG] trap not found, unable to acquire reqid 1234
9/8/2017 12:54:02 PM05[KNL] creating delete job for CHILD_SA ESP/0x00000000/<unaffected host2's public IP>
9/8/2017 12:54:02 PM05[JOB] CHILD_SA ESP/0x00000000/<unaffected host2's public IP> not found for delete
9/8/2017 12:54:02 PM15[KNL] creating acquire job for policy 10.42.201.228/32[6/37098] === 10.42.173.57/32[6/80] with reqid {1234}
9/8/2017 12:54:02 PM15[CFG] trap not found, unable to acquire reqid 1234
9/8/2017 12:54:02 PM16[KNL] creating delete job for CHILD_SA ESP/0x00000000/<unaffected host1's public IP>
9/8/2017 12:54:02 PM06[JOB] CHILD_SA ESP/0x00000000/<unaffected host1's public IP> not found for delete
9/8/2017 12:54:02 PM04[KNL] creating acquire job for policy 10.42.101.69/32[6/45564] === 10.42.182.122/32[6/27017] with reqid {1234}
9/8/2017 12:54:02 PM04[CFG] trap not found, unable to acquire reqid 1234
I guess it cannot execute this job because there is no IPsec connection to the other servers from this affected server.
ip route get <other host's 10.42.0.0/16 IP in the same environment>
returns:
<that other host's 10.42.0.0/16 IP in the same environment> dev eth0 src <this host's 10.42.0.0/16 IP>
So that seems correct. docker inspect
also shows the io.rancher.container.ip
label with this host’s 10.42.0.0/16 IP.
Maybe this is related to https://github.com/rancher/rancher/issues/9377 although the cause seems different. Doing docker exec -it $(docker ps | grep ipsec-router | awk '{print $1}') bash cat /proc/net/xfrm_stat
on the affected host results in:
XfrmInError 0
XfrmInBufferError 0
XfrmInHdrError 0
XfrmInNoStates 0
XfrmInStateProtoError 0
XfrmInStateModeError 0
XfrmInStateSeqError 89
XfrmInStateExpired 0
XfrmInStateMismatch 0
XfrmInStateInvalid 0
XfrmInTmplMismatch 0
XfrmInNoPols 0
XfrmInPolBlock 0
XfrmInPolError 0
XfrmOutError 0
XfrmOutBundleGenError 0
XfrmOutBundleCheckError 0
XfrmOutNoStates 233654
XfrmOutStateProtoError 0
XfrmOutStateModeError 0
XfrmOutStateSeqError 0
XfrmOutStateExpired 0
XfrmOutPolBlock 0
XfrmOutPolDead 0
XfrmOutPolError 0
XfrmFwdHdrError 0
XfrmOutStateInvalid 69
XfrmAcquireError 1
Any ideas?