'True HA' :Best way to get HA across Rancher hosts?

Hi there, all:

I’m new to rancher, passably well versed on docker. I’ve read most of the rancher documentation, and I’m still unable to answer a nagging question that I have about HA.

I understand that you an configure a service to be global-- meaning that it runs on all hosts. And i also understand that rancher will automatically integrate with route 53 to publish host names for services that are exposed on each host.

What I do not understand, though, is how people achieve multi-host ( multi-AZ) HA with this setup.

Consider, for example, that i have 3 hosts running, and all of them are running my same stack. Each host might be reachable on the host ip via myapp.my-host.mydomain.com. My goal in doing this is to get multi-AZ reliability.

But how do i present one face to the customer, since now I have three hosts? It seems to me there are really only two options:

(1) Use an DNS server to list all three hosts. This works, but its kind of annoying because this external DNS doesnt know about the rancher lifecycle: i have to manually remove these entries as i add copies of my stack ( or shut them down ).

(2) Point all users to one host using a load balancer, which sends traffic to the other three hosts running the stack. This has the benefit that it can be rancher managed, but the drawback that it is a single point of failure.

Are there other ways I’m not considering? Were I go to option 1, can rancher manage a single route53 dns alias that collectively points to all copies of my stacks across the hosts? Or, with option 2, is there a way to mitigate the single-point-of failure problem for the entry point router/lb?

Any help would be appreciated! I feel like I must be missing something: rancher has a ton of features, but I have not seen a treatment of this particular topic.

Thanks in advance for the help!
Dave

The Route53 integration (external-dns, Route53 is just one of several providers it supports) creates an A record which contains all the IPs of the hosts that are running the service. So if your hosts are 1.1.1.1, 2.2.2.2, and 3.3.3.3, it creates a record for <service>.<stack>.<environment>.<domain> with all 3 of those, and the client picks one. (And then you can CNAME that name to www.yourdomain.com or whatever). If a container or the host it’s on fail, the appropriate IP is removed from the entry. The default TTL is 60 seconds, so a failure can result in traffic going to a bad host for 1 minute. This works the same if the service happens to be a load balancer.

If you’re using AWS anyway you can improve on this by using ELB to balance to the Rancher services/balancers, because it can detect failed backends and remove them without having to wait for the DNS TTL.

You can also use a Virtual IP solution to implement something similar to your option 2, but with automatic failover.

We use this in our deployments without Rancher, and I have been testing a rancher compatible version developed mainly by a fellow forum member. Basically one or more hosts share one or more external IPs, and if the host goes down, another one is elected as the “answering”/“active” host.

If you are on Amazon it may be easiest to just use an ELB, but in our case its a colo so we like this approach better…

Here is the discussion with links to the github repo, etc… Rancher + Keepalived