Multi-AZ HA for Rancher Hosts in Amazon (AWS EC2)

We are attempting to have true multi-AZ HA in one AWS region. I believe that I have achieved HA for my rancher-server, using a multi-AZ mySql RDS to preserve the rancher-server state, and an AutoScalingGroup to keep a rancher-server EC2 running at all times.

I have also added several hosts to the rancher-server, making sure there are EC2 instances in both of our AZs. Deploying a number of EC2s in that manner isn’t sufficient to achieve HA. Something would need to terminate/restart an individual host that isn’t healthy. If there is an availability zone outage, something would need to start a bank of hosts in the remaining AZ.

Is there a way to make rancher do those terminates/starts automatically? If not, how do others implement multi-AZ HA of the rancher host nodes?

It seems that an AWS AutoScalingGroup could be used, but I would want assurance that is the correct approach before expending effort. A LaunchConfiguration could automate the steps for adding custom hosts (https://docs.rancher.com/rancher/v1.4/en/hosts/custom/). However, a <registrationToken> is required, and I’m not sure that a single token can be used repeatedly. When I go into the rancher-server web console to add a custom host, I do see a lengthy token in a full ‘sudo docker run’ command that a new EC2 instance needs to run to join the rancher cluster.

The closest forum post I found regarding that AWS ASG approach was from October 2016 (AWS Autoscaling Group hosts registering at Rancher Server). They seemed to have difficulty reusing the <registrationToken> as I described. They got no responses.

There were a couple other posts that seemed relevant. These were discussing auto-scaling, rather than simply providing HA at the same scale. These were discussions of extending rancher features, and it wasn’t made clear whether my use case would work. (See Rancher Host Autoscaling and https://github.com/rancher/rancher/issues/3893.)

Not sure if it helps, please take a look at:

Thanks. That is the exact approach I want to take. It will be very helpful for fleshing out the details (but with CloudFormation).

I did figure out my central question about the registration token. I am pretty sure it must be generated for each new host, which can be done using the rancher API (through HTTP requests).

I found a script that will generate the registration token. The script was at https://yayprogramming.com/auto-connect-rancher-hosts/, and uses the Rancher API documented at https://docs.rancher.com/rancher/v1.4/en/cli/ and https://docs.rancher.com/rancher/v1.4/en/hosts/custom/.

Roughly, this is executed from the new host node from it’s ASG LaunchConfiguration as it bootstraps itself:

RANCHER_URL="http://rancher..........donorschoose.org"
PROJECT_ID="$(curl -s $RANCHER_URL/v1/projects | jq -r ".data[0].id")"
curl -s -X POST $RANCHER_URL/v1/registrationtokens?projectId=$PROJECT_ID
TOKEN="$(curl -s $RANCHER_URL/v1/registrationtokens?projectId=$PROJECT_ID | jq -r '.data[0].token')"

We’re running in a private subnet without access controls yet, so YMMV. Something like this would run the rancher/agent on the new host, taking the newly generated registration token:

AGENT_VERSION="v1.2.0"
docker run -d --privileged -v /var/run/docker.sock:/var/run/docker.sock \
    -v /var/lib/rancher:/var/lib/rancher rancher/agent:$AGENT_VERSION \
    $RANCHER_URL/v1/scripts/$TOKEN

The registration token can be reused indefinitely (for that project/environment). Unless you’re dynamically spinning up multiple installations and projects you can just put the registration command in instead of looking one up in the API, which would require authentication if it was enabled on the server.

There is also a setting in 1.5+ to auto-remove hosts that are disconnected for longer than X seconds, which is useful with an ASG. It’s not in the UI yet but /v2-beta/settings/host.remove.delay.seconds

Vincent is correct. Indeed, the registration token doesn’t change.

I now have an AMI captured from a snapshot of a server node that was spun up using the Rancher Console. Now, every time I spin up a new EC2 and run the commands to register, it replaces one of the host nodes, rather than join the cluster. The older node is still running, it just isn’t appearing on the Infrastructure > Hosts screen.

Is there some ID or Name that I’ve inadvertently captured into the AMI? That’s all I can think of that would case the problem.

This is my command sequence on the new EC2:

RANCHER_URL=http://rancher.dctest2.aws.donorschoose.org

PROJECT_ID=$(wget $RANCHER_URL/v1/projects -O - | jq -r “.data[0].id”)

wget --post-data=“” $RANCHER_URL/v1/registrationtokens?projectId=$PROJECT_ID

TOKEN=“$(wget -O - $RANCHER_URL/v1/registrationtokens?projectId=$PROJECT_ID | jq -r “.data[0].token”)”

sudo docker run -d --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.2.0 $RANCHER_URL/v1/scripts/$TOKEN

Yes, the host has an ‘identity’ that you’ve imaged so each new host is the same one and takes over it. rm -rf /var/lib/rancher/state and (on newer versions) docker volume rm rancher-agent-state before imaging. Plus you might as well remove the agent container if it’s still there since it will get replaced during registration anwyay.

Thanks! By removing the agent container, you don’t mean removing the image from the docker image cache, correct? For several hours this morning, we were getting timeout errors with the docker.io image download. So one reason I am building out this AMI and ASG approach is to avoid that single point of failure.

No, like the defined container in docker ps -a. The cached image is fine, though you’ll have to update it for each release to keep it being the right being that will actually be run.