Host failed but container not failing over

We are running the latest 0.37. We defined a new service, scale=1, created it, launched.

Everything works great, it deploys the container to one of our 3 hosts in the environment.

Then we kill docker on the host, verify that docker is down and no process is running. The hosts part of the UI shows Reconnecting in red, which it should.

However, the single instance of our service still shows green and does not fail over to another host.

Then we actually went to the host in the UI and selected “Deactivate”. The host is now marked as inactive, but the containers running on that host are still marked green and do not fail over.

Any logs or anything else we can do to help, we are happy to do so.

This issue might help explain how host failovers work.

In a shorter summary, you need to add a health check to a service in order for the container to move to another host. Only when the host has been deleted from Rancher will the service move over.

Huh, well, that is confusing. While I can see how you might want to validate the service independently and not rely on the host, it really does violate the principle of least surprise.

So I need to configure my service with a health check?

@deitch, well, a surprise to you may very well be expected to someone else, so in this case, once you think about it, to have it any other way would be suboptimal.

Say your agent dies (or communication between server and agent is down, for some reason, but the agent is still reachable from clients), rancher will be unable to tell if your service is working properly or not. If it then would go under the assumption that it is not working, and spin up another instance on another host you may end up with two instances instead of one.

Damned if you do, damned if you don’t. Availability management is a beast no matter how you handle it.

Maybe an option on a service? “Assume service is down if host unreachable”? Defaults to no, but options to yes?

As I posted in the github discussion @denise referenced, if you have a service with no inbound connections (TCP or HTTP), e.g. a batch job that kicks off every 60 minutes and sends some output to somewhere, then it is crucial it runs, but there is no way to “monitor” it beyond the process… which Rancher kind of does via the agent.

Indeed, and I can see the use for such an option.

How about adding a listening socket that simply redirects to /dev/null for the purpose of aliveness test?

In theory, yes, although you don’t always have the ability to add software or open a port. What if the service is a single process (as docker recommends for containers, although lots of us run supervisord or s6 or our own multi-process control)? To add another process and a process supervisor and config files and all of the complexity when just monitoring the process should suffice.

I was under assumption you controlled the software running the batch job. I didn’t suggest a new process for the sole purpose of the listening socket, but expected it could be opened by the same (only!?) process running in the container.

Of course, if that’s not the case, it won’t be the best way to go :wink:

I have a bigger issue. How does it do the health checks? It has to depend on the IP. But what if the IP is provided by a different source? E.g. if we use a networking plugin, or an event monitor (think pipework), which is what we do, or even DHCP?

How are we supposed to make the health checks work when net=none and we manage it separately?

Haha… oh, I do love corner cases :smiley:

I think for that to work, it would be better to be able to tell rancher if a service is down (so you can get rancher to proceed with migration logic etc). That way, you can let some other part in your system do the monitoring and alert rancher whenever a service is not responding properly.

Haven’t looked at the API regarding this, perhaps it’s already in there… @vincent ?

Set up another container to monitor it? Lot of extra work, sigh, and it would effectively be another, more lightweight, rancher agent: single container running on a host to manage other containers.

I am not sure how much of a corner case it is. The very strong desire for alternate networking - including using Rancher-managed-networking - indicates that the default bridge is not what many people want.

But, yes, I can see how you would have a hard time doing health checks if you do not know the IP in the orchestrator.

Just had the same issue today, aws asg deleted an instance due a instance status check failure and created a new one.
The host running the containers was in reconnecting state for 1.5hours and the load balancer (that runs on all instances) was still redirecting requests to to the failed host.
I imagined that an host is marked as “unavailable” after some time of reconnecting and services being migrated

So in the meantime… how do I get failover within the group of hosts to work?

OK, so I thought we had an answer… but it doesn’t work.

We added Rancher-managed networking (and then added our own other IPs), added health checks to port 22 (since sshd happens to be running on these, and it is a good test), completely brought down docker (and hence all containers) on one host… and the health checks continued to pass. It didn’t fail them over.

We checked on the containers themselves, eth0 definitely is getting a docker-bridge-managed IP and the Rancher-managed IP, but killing docker entirely on the host has no impact on the health checks. We checked that pings fail the moment we stop docker - which means, of course, that port 22 also is not answering.

We need a better understanding of how to explore the behaviours of health checks, so we can understand why they are failing here.

FYI, this is still an issue. I opened an issue at github as well.

Container with Rancher-managed IP and health checks properly set up, bring down host entirely, health checks continue to say “healthy”. network-agent is verified as running on an alternate host.

Is there a solution for this. I had an instance fail and be respawned by an aws auto scaling group, the old instance stayed in reconnecting until I noticed the instance was down. The new instance was sitting idle and all containers on the old host were reporting healthy.

Just to clarify, the old containers had healthchecks enabled. They’d ping their own sites to determine health, but someone once the host died it stayed healthy.

Is each host itself responsible for doing the healthchecks? Like if Host A runs three containers that do healthchecks, is the HostA rancher agent doing the actual healthcheck or it is distributed among other hosts/main rancher server?

How many hosts do you have?

You could be hitting this issue:

It does look similar, but I don’t believe mine to be much simpler. I create two hosts, then create a service with 2 instances that have health checks, and create a load balancer. I then kill one instance externally. The services on that host will never move to the health one. If I create a new instance, they won’t move either. If I restart them they’ll move or if I delete their host through rancher they’ll move.