Intermittent issue with communication between containers across different hosts

We are running a setup with Rancher v1.1.4. Most of the times our setup is working as expected, however at times we see issues where containers across different hosts aren’t able to communicate with each other.

During such times ping from one container to another fails with 100% packet loss. The only way around is to restart network agent on troublesome host.

While trying to debug the issue I also found that there were lots of ICMP redirects happening even when containers were happily communicating with each other. Should this be of some concern as far as troubleshooting the issue is concerned?

I am completely clueless here, as network agent logs don’t reveal much and I am unable to make any sense from those ICMP redirects ?

Thanks much in advance for any help around this matter.

Cheers,
M

1 Like

Sounds like one or more IPsec tunnels between hosts are failing.

You can check status with this: swanctl --list-sas - there should always be one less SA than you have hosts in an environment (so if you have four hosts, you should see 3 SAs).

If that is the issue, you can restart the Charon daemon (far better than the entire network agent container) by exec’ing into the network agent and using this command: monit restart charon.

Never found a cure for this issue (never happened too often) but we’ve not had it since upgrading to RancherOS v0.7.1 around 6 weeks ago. Of course you may not be using it and we previously saw this problem less frequently than that time period so we may yet.