Automatically remove disconnected hosts

tbossert · November 14, 2015, 9:33pm

I am using AWS autoscaling to bring up hosts that register with rancher on boot, however when the machines are shutdown after load decreases they end up in a reconnecting state and traffic from rancher LB’s seem to continue to flow to the containers on these instances until I manually remove them in the UI. This caused a period of outage as some of the requests coming in would time out.

The first thing that needs to change is that as soon as a host goes into reconnecting all traffic should cease to those containers, all LB’s,service aliases should be aware of this.

The second thing which would be a nice to have is the option in rancher to say, any host that is disconnected for more than a specified configureable timeout should be deactivated and purged.

Please let me know if there are any workarounds or things I could try to mitigate the outages when spinning down instances.

Thanks

Trevor

ColtonProvias · November 14, 2015, 10:59pm

One workaround may be to write a script that periodically hits the API to delete inactive hosts from Rancher.

vincent · November 14, 2015, 11:01pm

Integration to be able to have the host automatically unregister itself in response to a spot instance shutdown event is coming soon…

tbossert · November 14, 2015, 11:04pm

Yea for deleting hosts, I was going to write a script until the functionality is added into Rancher. In all reality the issue with traffic seeming to still go to disconnected hosts is my most immediate problem.

tbossert · November 14, 2015, 11:04pm

Awesome, Thanks for letting me know

vincent · November 15, 2015, 12:06am

Is there a health check in the service(s) being balanced to? If there is the containers on failed hosts should stop getting traffic from haproxy quickly (unhealthy threshold * timeout).

tbossert · November 15, 2015, 12:14am

Yes I do have health checks on the services

http check with settings of 3 failed checks at interval of 2 seconds

The behavior I saw earlier today seemed to only be resolved after I manually removed the disconnected hosts through the UI, before that for several minutes I had intermittent failures in reaching my app. The app and load balancer are set to be global in case that makes and difference.

I will be doing some testing in our staging env next week and will hopefully have more info at that time.

Thanks for the quick responses.

tbossert · November 16, 2015, 9:16pm

Part of what I am seeing might be not so much the load balancers not updating, but when I lose a host with a load balancer container on it, the list of load balancer ips is not being updated. This results in traffic flowing to load balancer containers that no longer exist.

vincent · November 17, 2015, 12:28am

Ah yes… If you’ve got your own DNS entries listing the host IPs then there’s no way for us to remove them from there. What some people have done before is put something like Amazon ELB as a ‘dumb’ TCP balancer in front of the Rancher one, so it can automatically remove failing IPs.

In the last release we added the Route53 DNS catalog item which can automatically keep a Route53 zone up to date with the active host IPs for each service (i.e. serviceName.stackName.environmentName.yourzone.com), and then you can CNAME that to the actual names you want to balance.

There are some bugs in that release with handling disconnected hosts though which you might run in to, those will be fixed in the next release (0.46).

tbossert · November 18, 2015, 4:33am

Could you point me to the github issues for the those bugs you mentioned. It would be helpful to know what to expect.

Thanks

Trevor

vincent · November 18, 2015, 4:57pm

Something like this… https://github.com/rancher/rancher/issues?q=milestone%3A"Milestone+11%2F11%2F2015"+is%3Aclosed+health

(and 0.46 is out now)

tbossert · November 18, 2015, 5:03pm

Awesome I am upgrading as I type this.

Thanks

Trevor

tbossert · April 12, 2016, 5:12pm

Was curious if there was an updated timeline for host removal on spot instance shutdown?

denise · April 28, 2016, 5:03pm

It exists, but we haven’t actually declared support or documentation for it.

Could you add an issue in Github so that we could track it to make it more official?

jmreicha · May 17, 2016, 6:15am

Interested in this functionality as well. Is there an issue to follow for this or somewhere to go look at examples?

ryanwalls · March 7, 2017, 9:01pm

@denise @vincent

Any updates on unregistering a host? I’m experiencing lots of networking issues when we scale in hosts in my k8s cluster. https://github.com/rancher/rancher/issues/8086

ampedandwired · March 8, 2017, 1:52am

I wrote a Rancher reaper service that periodically purges Rancher disconnected hosts that no longer exist in AWS. You could give that a try while waiting for an officially endorsed solution.

wmbutler · April 30, 2017, 12:20am

I wrote wmbutler/rancher-purge. It’s a container with configurable Environment variables. Should take no longer than a few minutes to configure and deploy.

Topic		Replies	Views
Bug in rancher server 1.5.1 Rancher 1.x	2	843	March 21, 2017
Thoughts / Questions after bringing Rancher to Production on AWS Rancher 1.x	2	1218	September 27, 2016
All AWS Hosts Disconnected on Rancher 1.3? Rancher 1.x	7	1923	January 26, 2017
Auto-Restart and reconnecting host Rancher 1.x	3	1485	August 3, 2016
Webhook stopping service on wrong host (another has been downed) Rancher 1.x	6	835	March 6, 2017

Automatically remove disconnected hosts

Related topics