Intermittent Failure of Managed Network causing critical issues for some containers

Hey there,

My org is having a persistent issue with Rancher/Docker’s managed network layer. We have three physical hosts in one datacenter and when it fails, containers affected are note able to communicate with each other and one of the admins has to manually restart the IPSec router container on the affected host to fix it. It usually happens on one specific host, but has occurred intermittently on others and has also occured in our staging environment which is a mix of physical and virtual hosts in a separate datacenter.

Honestly, Googling the issue seems to indicate that it’s a ‘bug’ and that there isn’t a solution, which is kind of ridiculous. Has anyone experience the problem? Were you able to correct it without ditching use of managed network altogether?

@slemaire can you please find/ping me (leodotcloud) on https://slack.rancher.io next time when this issue happens? I would like to jump on a call to debug this further.