I am working with a brilliant team of sys admins and engineers who are saying, with great conviction, and after weeks of effort, that Rancher HA is NOT STABLE right now and do not recommend that we use it. We are using an uncomplicated 3 node setup with external Cisco FW’s.
I am pushing back strongly on this with my team, but I would love any feedback from those who have done Rancher HA in an enterprise production environment.
Let me know what you think.
Why are they saying it’s not stable?
HA ALWAYS depends on multiple components in your setup. If you’re using a mysql database that isn’t highly available, you don’t have HA, simple as that. Same goes for the external loadbalancer and everything that may affect your setup (network connections, storage etc.)
Personally I’ve tried the HA features and think they work great. But I don’t really need them, because my applications (containers) will always run without rancher.
I have to totally agree with them.
I have the instances essentially stop forwarding requests (the requests will 503), and then I have to restart the network daemon on all of my nodes after which it will work again. The nodes are working perfectly, in addition to this nastiness, HA proxy has crashed several times on me, taking down the entire API with it. I will simply log in to rancher, and the load balancer will be red or yellow, with no informational logs, and I have to stop and restart it to get it to work.
The real troubling part about this is that it’s random and without logs. This is a recipe for “not happening” in production.
I don’t recall any issues at all until I started using certificates. My original architecture used a fixed non-docker NGINX rproxy that forwarded everything to HAProxy for load balancing. Once I switched the cert endpoints to HAProxy and had the proxies serving directly, that’s when I started seeing issues crop up once in a while.
I really like the security that having all of my containers linked provides without exposing ports to the outside world, or maintaining a firewall on each of my instances. I have made up my mind though, if HAproxy craps out in the next week for us, I’m pulling it and moving everything back to nginx.
Essentially what I hear is that one of the services will die and bring down the whole environment. In addition, there are minimal if any clues as to why something dies. Recovery is essentially a complete restart of all services.
I know some people are using it in production, and I just am not sure what makes it work for some and not for others. Some parts must be less than reliable. But, which ones?