Occasionally, a container is unreachable by IP?

Running Rancher w/Cattle v1.1.2 on AWS with Docker 1.10.3 hosts.

On occasion, I’ll run an upgrade, and one or more containers will not be accessible by their ip address from any given number of containers. It doesn’t seem to matter which node - other containers on the same host are reachable. This results in downtime and/or slow response times from it’s LB (Haproxy times out a request after 5sec, so we notice a lot of ~5.1s requests).

Service (A) upgrades and when complete, 2 of the 3 containers can no longer connect to service (B)'s load balancer: http://service-b-lb. Destroy the containers (scheduler re-creates them to keep expected scale) and the connection is restored.

Is this a known issue that an upgrade to Docker daemon or rancher 1.1.3/1.2 would fix?

Still not sure what the cause was, but it seems to be resolved by:

  • upgrading rancher to 1.1.3
  • Cycling-in new hosts with Docker 1.11 engine and cycling-out the old hosts

I’m still seeing this in 1.1.4. I have to restart the networking agent to remedy the problem.

Yep, seeing the same occasionally on our stuff. The IPSec tunnel is broken for a host; restarting the network agent fixes. It would be good if there was something self-healing coming in Rancher 1.2?