Need for Rancher HA / Impact if down?

I’m wondering at what point does it become necessary to set up Rancher 2.x in HA. I have experience enough with Rancher 1.x with Cattle, but with Kubernetes I don’t really know.

From what I understand, all the workloads will continue operating as expected if the Rancher server is down, but you wouldn’t be able to deploy, upgrade or make changes to the various services/pod/whatever when it is down.

What about monitoring/restarting failed containers (i.e. health checking and handling)? Would that continue with the Rancher server down?

In other words, for a small/medium installation, if I have, let’s say, the capability of restoring the service within a few minutes, then HA would be an overkill?

Which brings the question - how much data is stored by the Rancher container? And does it change much?

Next there is the cluster. When setting up a cluster, you must run a number of nodes with etcd and control plane. 1, 3 or 5 is recommended. Obviously 1 is good in dev and more recommended for production.

But, what’s the impact if etcd and control plane are not up? In other words, what is the need versus the cost. Running it on 3 or 5 hosts maybe an overkill, in some circumstances, but one would need to understand the risk or what would be impacted.

Also, what is the necessity to run etcd and control plane on their dedicated hosts? Would it be reasonable to have, say 5 nodes that run workload, control plane and etcd?

1 Like

I am in no way an expert :slight_smile: … But just wanted to share my thoughts on the same question

We are currently running a 1.6 cluster, and is going to upgrade to 2.0 soon (Both K8s), our 1.6 is running single instance mode, and i am most likely going to do the same for 2.0. Mostly because in my experience if rancher goes down, it only affects the API, which of cause is critical, but manageable at least to us.

I have tried a couple of times rancher goes down for both our 1.6 and 2.0 cluster, and it’s quite easy to get up again, especially because it runs in a container. The 2.0 Cluster is setup with a host mounted drive, making it even easier :smiley:

(This is ofc only for rancher, in each cluster i will of cause have HA etcd and k8s control)

Again, no expert, but that’s my thoughts at least :slight_smile:

Thanks for sharing your experience and thoughts!

It is obviously all a matter of how big… using Rancher and Docker has made our internal deployment of internal systems much easier. So in my case, we’re talking a relatively small deployment.

From all that I read so far of Rancher’s documentation and Kubernetes documentation, books and videos, it seems like the need for running the Rancher container in HA isn’t that great for me.

What I wonder now is the etcd and control. From what I get, the orchestration is mostly really handled by Kubernetes so if I run 3 x etcd and 3 x control, then in theory, even if Rancher itself is down, the pods would be controlled (i.e. if a pod’s container failed, a new one would be started).

I have to test that a bit more…

Did some smoke tests… I shutdown the Rancher 2.0 container and:

  1. I could no longer use kubectl :frowning:
  2. I killed the main process inside of the containers for a deployment with replicaset of > 1 and the containders/pods were recreated.

So the impact of not having Rancher in HA is that you won’t be able to manage your things very well while it’s down, but your workload is still up and still being supervised (at least from the results of this smoke test).

So I would extrapolate from this test, that it is best to have multiple etcd/control plane running (3, 5) as this is what seem to keep your things up.

Well, I’m finding some very strange behavior in some very simple tests.

I have:

1 x VM running Rancher 2.0.1
3 x VMs nodes, running all services (worker, etcd, control)

I have a few workload deployed - some NGINX web servers mostly. All those workloads are working on node1.

Everything is working well, the Web UI is responsive, kubectl is responding nicely, the workloads are working. kubectl and the web UI are going to the API on the VM hosting the Rancher server.

I then disconnect the network of one of the 2 nodes that currently have no workload - say node2. This means it takes down one etcd and one control.

I would expect the API to remain available and everything to be all nice and working… But no, that’s not what happens. Instead, kubectl and the Web UI hang.

After a bit, kubectl gets an answer and a bit longer and the WebUI is back.

In one of my test, at some point just after the Web UI came back, the pods were “updating” and ingress responded with a 503 service unavailable.

Is that the expected behavior?

Well, seems similar:

And here is some more data the first point of Rancher itself.

Also related, I configured the alerts to go to a Slack channel and I randomly get alerts that etcd is down yet the WebUI shows everything honky-dory.

Related:

I have updated https://github.com/rancher/rancher/issues/13698 with a comment on making this better in a not-too-distant release before 2.1.

https://github.com/rancher/rancher/issues/13830 is something we could not reproduce and have asked the bug creator to help us out. If we can reproduce, we will fix it.

Thank you!

These answers go a long way to build confidence! I totally understand that such a major release may be a bit overwhelming - been there!

Maybe it will be covered in the updated documentation - if not, please make a note to document more what is the need of having Rancher in HA - obviously there is a resource cost and added complexity to have Rancher in HA. Same goes for having 1,3 or 5 nodes with etcd/controlplane.

Users will need to evaluate the resource cost versus the benefit versus the problems.

For example, if Rancher is down, or not reacheable by the nodes of a cluster, then I can’t deploy new stuff. OK, it obviously make sense as Rancher is auth central.

But it’s good to know that despite Rancher not being accessible, kubernetes still monitors the workloads of the cluster and will restart a pod if it crash. That’s what I care about the most.

I want to know that the cluster running in Australia will continue behaving with the current deployed containers when the Rancher instance in the US is down.

Will do. That is in our plans to do so.