Hi guys, I just went through setting up a HA Rancher with 3 nodes twice and I have to say the process was quite unpleasant and could use some more documentation.
I am writing this post to get some feedback on where I might have gone wrong or where I might have looked to debug issues faster for the next time I am setting up rancher-ha.
First of all it is not at all clear that you have to start the rancher-ha script on all nodes before anything happens. I was trying for almost an hour to access the admin UI thinking that starting one node should at least show me something along the lines of “cluster not available or something” - but only once I started the script on multiple nodes and cattle apparently found peers it started up anything. Before that it was just error messages in the logs and no connectivity at all.
After this we got the system up and running, did some HA tests (shutting down nodes, upgrading from rancher 1.1.3 to 1.1.4 etc) and were satisfied with the results so we decided to move our existing rancher 1.1.2 setup to the new HA cluster.
We decided to wait a week with the migration and I shut down all instances running the HA cluster. Today I fired them back up and expected everything to come back to life and run.
No - rancher did not start at all at first. So I spent almost an hour rebooting the instances one after another until
docker logs rancher-ha showed me something besides failed connection attempts. At some point I could clearly see that rancher was somewhat up, but I could not access it through the web as apparently the proxy on port 81 was not running on any nodes.
So I changed my ELB to point to port 18080 and got in - only to be unable to log into the system with GitHub OAuth for some unknown reason.
At this point I tried disabling authentication through the database and could see that only 1/3 nodes where actually part of the cluster. So I rebootet once again everything and re-enabled the OAuth githubconfig in the database.
Without any changes I could log in with GitHub this time (no idea why it worked on instance reboot number 6).
Once again 1/3 instances came back up so I completely killed rancher on on the machines that apparently failed to reconnect to the cluster:
sudo docker rm -f $(docker ps -a -q) && sudo rm -rf /var/lib/rancher && sudo ./rancher-ha.sh
That has worked now and my cluster is back up and running. But I am not sure if I ever want to shutd own the nodes again.
Now that everything seems to be running again I noticed some things that confuse me:
Since I can access the Admin UI on 18080 - why is there a proxy-service on Port 81 that serves the same UI?. Also if this proxy service is supposed to provide HA - why is it running twice on one host and 0 times on the other 2 hosts in the cluster?
Another thing I noticed wihle I was rebuilding the cluster: the services status dashboard was completely wrong. It was showing up&running services and stacks although the host machines running these services where down. I had previously stopped all 5 instances (3 servers and 2 hosts) through the AWS console and even though the 2 hosts where still in reconnecting state in the tnfrastructure tab the services dashboard did show all green even with running service instances (green circles). At least a yellow warning sign or something would be great here to indicate that a host is missing - at first glance I thought everything was running although I had totally forgotten to start up the host machines.
The setup we are running is Rancher 1.1.4 on AWS. The Database is mysql in AWS RDS and we are running docker
Is this normal? Is everyone experiencing issues like these with HA?
Thanks for your feedback,