Webhook stopping service on wrong host (another has been downed)

I’m exploring autoscaling with Rancher. I have upgraded to 1.4.1, and really appreciate the simplicity of the webhook functionality.

We are approaching scaling on AWS by starting/stopping hosts, rather than creating/destroying them.

What I have found is that when I stop a host via AWS (boto) and Rancher notes the host showing as “Reconnecting”, if I scale down via the webhook, Rancher does not take into account that a host is unavailable, and removes the service from one of the active hosts.

Where can I find the algorithm that Rancher uses to decide which host to remove a service from, and is there a way to make it remove the service from the host that has just become unavailable? Without this, I’ll have to write my own webhook, which will be a lot harder than using the new feature.

Thanks!

Disconnected/Reconnecting means that the agent container is not connected to the server. This is not a reliable indicating that a container that was on that host is no longer actually running, which is very important for some use cases.

So the service is not unavailable on that host unless it has a healthcheck… And if it did then it would automatically be rescheduled to an active host once it started failing, according the default health check action.

Thanks Vincent!

Are you saying that the sequence I’ve described would be (assuming we have four hosts):

Host 3 stopped at Amazon. Webhook called, dropping us to 3 services, Rancher stops the service on Host 2, as the host is “reconnecting”. After some time, Rancher’s attempts to connect to the healthcheck service on the host fail, and Rancher decides that the node is not available, and reschedules the service from the down host back to an available one, i.e. host 2 - meaning after this time, we are back with three services on our three available hosts.

If this is correct, then it isn’t totally disastrous, but does leave us with one less host than we intended for a period, which isn’t great.

Do you have any suggestions as to how to handle this? I’d like to have the script tell Rancher to ‘stop’ the host, that way Rancher would know to reschedule the service immediately, but the Rancher CLI does not seem to support “stop a host” as an action. Therefore, I’d need to use the HTTP API instead, which I’ve done before, but seems more complex and more fragile.

Any other ideas how I can ensure that the service is removed from the host that I have stopped at AWS?

Thanks again!

I’ve tried this - I downed the host, and reduced the number of services. My app stopped host 3, but the webhook stopped the service on host 2. Even though host3 is now showing as “disconnected” the service is still registered on host 3, which obviously won’t be working.

Am I missing something? Any other ideas? How hard is it to write a webhook? Can I add one to “start”/“disconnect” a host?

@Upayavira I think what you really want to do is at a healthcheck to your service. Once you do that, when you take down the amazon host, the container for that service running on the downed host will definitely get removed.

But I’m actually a little confused as to what you are trying to do. Once you’ve stopped the host, why are you calling the webhook to scale down? Is it just so that the configured scale of the service matches reality?

Okay, thanks @cjellick, that is worth trying. Yes, you are right, I want the config to match the reality.

So the sequence becomes, stop host, wait for healthcheck timeout window to pass, call webhook, then everything should work as expected.

I would have thought that the host healthcheck would have covered that need, though, but your approach is certainly worth trying. Thanks!

@cjellick from my initial tests, that appears to be working beautifully. I just need to wait for the timeout window for the healthcheck to pass before hitting the webhook, and it seems to behave perfectly!

(fingers crossed I’m not speaking too soon!)

Thanks!