Host Split/Partition Behavior in Overlay Network

I’m currently witnessing a problem where containers that get scheduled onto a single host seem to start and run just fine but cannot reach other containers on any other hosts in the environment (via the 10.42.x.x network).

If I open a shell to one of the containers, I am able to resolve the names of the other containers just fine and their IP addresses are correct. However I cannot open connections to ports that the other containers are exposing.

For the time being, I’ve ‘evacuated’ the host and disabled it so that my containers get scheduled onto others.

One other thing I noticed is that the healthcheck service shows stuck as initializing (on other hosts this comes up OK), but all other infrastructure services appeared to be working just fine.

Here’s a screenshot:

Any idea what could be causing this?

I just came across the following bit from the docs –

From what I can tell, the fact that healthcheck is not successful means that cross-host communication is not working properly.

Are there any more granular steps to do to narrow down the problem than this?
What are recommended solutions?

My first guess is to take the host out of the environment completely and add it back in. Is this the recommended approach to solve this problem?