I am using AWS autoscaling to bring up hosts that register with rancher on boot, however when the machines are shutdown after load decreases they end up in a reconnecting state and traffic from rancher LB’s seem to continue to flow to the containers on these instances until I manually remove them in the UI. This caused a period of outage as some of the requests coming in would time out.
The first thing that needs to change is that as soon as a host goes into reconnecting all traffic should cease to those containers, all LB’s,service aliases should be aware of this.
The second thing which would be a nice to have is the option in rancher to say, any host that is disconnected for more than a specified configureable timeout should be deactivated and purged.
Please let me know if there are any workarounds or things I could try to mitigate the outages when spinning down instances.
Yea for deleting hosts, I was going to write a script until the functionality is added into Rancher. In all reality the issue with traffic seeming to still go to disconnected hosts is my most immediate problem.
Is there a health check in the service(s) being balanced to? If there is the containers on failed hosts should stop getting traffic from haproxy quickly (unhealthy threshold * timeout).
http check with settings of 3 failed checks at interval of 2 seconds
The behavior I saw earlier today seemed to only be resolved after I manually removed the disconnected hosts through the UI, before that for several minutes I had intermittent failures in reaching my app. The app and load balancer are set to be global in case that makes and difference.
I will be doing some testing in our staging env next week and will hopefully have more info at that time.
Part of what I am seeing might be not so much the load balancers not updating, but when I lose a host with a load balancer container on it, the list of load balancer ips is not being updated. This results in traffic flowing to load balancer containers that no longer exist.
Ah yes… If you’ve got your own DNS entries listing the host IPs then there’s no way for us to remove them from there. What some people have done before is put something like Amazon ELB as a ‘dumb’ TCP balancer in front of the Rancher one, so it can automatically remove failing IPs.
In the last release we added the Route53 DNS catalog item which can automatically keep a Route53 zone up to date with the active host IPs for each service (i.e. serviceName.stackName.environmentName.yourzone.com), and then you can CNAME that to the actual names you want to balance.
There are some bugs in that release with handling disconnected hosts though which you might run in to, those will be fixed in the next release (0.46).
I wrote a Rancher reaper service that periodically purges Rancher disconnected hosts that no longer exist in AWS. You could give that a try while waiting for an officially endorsed solution.
I wrote wmbutler/rancher-purge. It’s a container with configurable Environment variables. Should take no longer than a few minutes to configure and deploy.