I have a Rancher setup with remote hosts in various parts of the world. I know this is not the typical use case for Rancher but it should work in theory. I have been putting together a POC with about 11 hosts in US, CA, JP, KR, AU, ID, GB, IN, DE, BR, and FR.
As more hosts were added, IPsec and health checks started failing, and containers had trouble initializing. I isolated it to KR, JP and AU hosts. I would seem that if I add any one of these hosts the system works fine, but if I add more then one of them, each of their IPsec services become unhealthy. Other hosts are unaffected except that containers appear to struggle to initialize when the IPsec service is degraded anywhere. If I remove all but one of these three hosts, the system stabilizes again. Oddly these hosts are all in Asia Pacific AWS regions. Other than that, I can’t see anything else that these hosts have in common.
Specifically the IPsec service on the new trouble host never finishes initializing and the IPsec service on the existing trouble host returns “NOT OK” health check. IP connectivity check logs show the peer being added but never becoming reachable.
Rancher v1.6.16. Ubuntu and Rancher OS hosts.
What I’ve tried:
- Scheduled the health check services to only a few reliable hosts
- Verified port 500 and 4500 connectivity between the trouble hosts
- I suspected that maybe hosts with the same subnet IPs in different regions might get confused so I made those unique. That didn’t help and I read that Rancher uses only the registered (in this case all public) IPs.
- I tried various OS’s for these hosts (Rancher, Ubuntu) with the same results
- I did try to ping one of the affected IPsec containers from the other and that failed, pinging either of them from other containers works however.
What might be causing this?