I am using AWS autoscaling to bring up hosts that register with rancher on boot, however when the machines are shutdown after load decreases they end up in a reconnecting state and traffic from rancher LB’s seem to continue to flow to the containers on these instances until I manually remove them in the UI. This caused a period of outage as some of the requests coming in would time out.
The first thing that needs to change is that as soon as a host goes into reconnecting all traffic should cease to those containers, all LB’s,service aliases should be aware of this.
The second thing which would be a nice to have is the option in rancher to say, any host that is disconnected for more than a specified configureable timeout should be deactivated and purged.
Please let me know if there are any workarounds or things I could try to mitigate the outages when spinning down instances.
One workaround may be to write a script that periodically hits the API to delete inactive hosts from Rancher.
Integration to be able to have the host automatically unregister itself in response to a spot instance shutdown event is coming soon…
Yea for deleting hosts, I was going to write a script until the functionality is added into Rancher. In all reality the issue with traffic seeming to still go to disconnected hosts is my most immediate problem.
Awesome, Thanks for letting me know
Is there a health check in the service(s) being balanced to? If there is the containers on failed hosts should stop getting traffic from haproxy quickly (unhealthy threshold * timeout).
Yes I do have health checks on the services
http check with settings of 3 failed checks at interval of 2 seconds
The behavior I saw earlier today seemed to only be resolved after I manually removed the disconnected hosts through the UI, before that for several minutes I had intermittent failures in reaching my app. The app and load balancer are set to be global in case that makes and difference.
I will be doing some testing in our staging env next week and will hopefully have more info at that time.
Thanks for the quick responses.
Part of what I am seeing might be not so much the load balancers not updating, but when I lose a host with a load balancer container on it, the list of load balancer ips is not being updated. This results in traffic flowing to load balancer containers that no longer exist.
Ah yes… If you’ve got your own DNS entries listing the host IPs then there’s no way for us to remove them from there. What some people have done before is put something like Amazon ELB as a ‘dumb’ TCP balancer in front of the Rancher one, so it can automatically remove failing IPs.
In the last release we added the Route53 DNS catalog item which can automatically keep a Route53 zone up to date with the active host IPs for each service (i.e.
serviceName.stackName.environmentName.yourzone.com), and then you can CNAME that to the actual names you want to balance.
There are some bugs in that release with handling disconnected hosts though which you might run in to, those will be fixed in the next release (0.46).
Could you point me to the github issues for the those bugs you mentioned. It would be helpful to know what to expect.
Awesome I am upgrading as I type this.
Was curious if there was an updated timeline for host removal on spot instance shutdown?
It exists, but we haven’t actually declared support or documentation for it.
Could you add an issue in Github so that we could track it to make it more official?
Interested in this functionality as well. Is there an issue to follow for this or somewhere to go look at examples?
Any updates on unregistering a host? I’m experiencing lots of networking issues when we scale in hosts in my k8s cluster. https://github.com/rancher/rancher/issues/8086
I wrote a Rancher reaper service that periodically purges Rancher disconnected hosts that no longer exist in AWS. You could give that a try while waiting for an officially endorsed solution.
I wrote wmbutler/rancher-purge. It’s a container with configurable Environment variables. Should take no longer than a few minutes to configure and deploy.