Need for Rancher HA / Impact if down?

etlweather · May 25, 2018, 11:43pm

I’m wondering at what point does it become necessary to set up Rancher 2.x in HA. I have experience enough with Rancher 1.x with Cattle, but with Kubernetes I don’t really know.

From what I understand, all the workloads will continue operating as expected if the Rancher server is down, but you wouldn’t be able to deploy, upgrade or make changes to the various services/pod/whatever when it is down.

What about monitoring/restarting failed containers (i.e. health checking and handling)? Would that continue with the Rancher server down?

In other words, for a small/medium installation, if I have, let’s say, the capability of restoring the service within a few minutes, then HA would be an overkill?

Which brings the question - how much data is stored by the Rancher container? And does it change much?

–

Next there is the cluster. When setting up a cluster, you must run a number of nodes with etcd and control plane. 1, 3 or 5 is recommended. Obviously 1 is good in dev and more recommended for production.

But, what’s the impact if etcd and control plane are not up? In other words, what is the need versus the cost. Running it on 3 or 5 hosts maybe an overkill, in some circumstances, but one would need to understand the risk or what would be impacted.

Also, what is the necessity to run etcd and control plane on their dedicated hosts? Would it be reasonable to have, say 5 nodes that run workload, control plane and etcd?

jonas · May 27, 2018, 8:57am

I am in no way an expert … But just wanted to share my thoughts on the same question

We are currently running a 1.6 cluster, and is going to upgrade to 2.0 soon (Both K8s), our 1.6 is running single instance mode, and i am most likely going to do the same for 2.0. Mostly because in my experience if rancher goes down, it only affects the API, which of cause is critical, but manageable at least to us.

I have tried a couple of times rancher goes down for both our 1.6 and 2.0 cluster, and it’s quite easy to get up again, especially because it runs in a container. The 2.0 Cluster is setup with a host mounted drive, making it even easier …

(This is ofc only for rancher, in each cluster i will of cause have HA etcd and k8s control)

Again, no expert, but that’s my thoughts at least

etlweather · May 27, 2018, 6:54pm

Thanks for sharing your experience and thoughts!

It is obviously all a matter of how big… using Rancher and Docker has made our internal deployment of internal systems much easier. So in my case, we’re talking a relatively small deployment.

From all that I read so far of Rancher’s documentation and Kubernetes documentation, books and videos, it seems like the need for running the Rancher container in HA isn’t that great for me.

What I wonder now is the etcd and control. From what I get, the orchestration is mostly really handled by Kubernetes so if I run 3 x etcd and 3 x control, then in theory, even if Rancher itself is down, the pods would be controlled (i.e. if a pod’s container failed, a new one would be started).

I have to test that a bit more…

etlweather · May 29, 2018, 1:22am

Did some smoke tests… I shutdown the Rancher 2.0 container and:

I could no longer use kubectl
I killed the main process inside of the containers for a deployment with replicaset of > 1 and the containders/pods were recreated.

So the impact of not having Rancher in HA is that you won’t be able to manage your things very well while it’s down, but your workload is still up and still being supervised (at least from the results of this smoke test).

etlweather · May 29, 2018, 1:23am

So I would extrapolate from this test, that it is best to have multiple etcd/control plane running (3, 5) as this is what seem to keep your things up.

etlweather · May 31, 2018, 12:27am

Well, I’m finding some very strange behavior in some very simple tests.

I have:

1 x VM running Rancher 2.0.1
3 x VMs nodes, running all services (worker, etcd, control)

I have a few workload deployed - some NGINX web servers mostly. All those workloads are working on node1.

Everything is working well, the Web UI is responsive, kubectl is responding nicely, the workloads are working. kubectl and the web UI are going to the API on the VM hosting the Rancher server.

I then disconnect the network of one of the 2 nodes that currently have no workload - say node2. This means it takes down one etcd and one control.

I would expect the API to remain available and everything to be all nice and working… But no, that’s not what happens. Instead, kubectl and the Web UI hang.

After a bit, kubectl gets an answer and a bit longer and the WebUI is back.

In one of my test, at some point just after the Web UI came back, the pods were “updating” and ingress responded with a 503 service unavailable.

Is that the expected behavior?

etlweather · May 31, 2018, 12:44am

Well, seems similar:

github.com/rancher/rancher

When control plane becomes unavailable (host powered down) when there is another control plane in the cluster , not able to deploy nodes in the worker plane and other control plane.

opened 09:56PM - 24 May 18 UTC

closed 09:15PM - 06 Aug 18 UTC

sangeethah

kind/bug

**Rancher versions:v2.0.2 **Steps to Reproduce:** Create a cluster with foll…owing node configurations: 1 control (n1) 1 etcd (n2) 1 worker (n3) Add 1 more control node (n4) Power down control node - n1. Wait for the node to be marked "unavailable". Try to create a daemon set. 3 pods get created out of which only 1 pod was able to start sucessfully which is on the new control node. There is an attempt made to start a pod in the worker node that fails with following error: ``` Normal SuccessfulMountVolume 5m kubelet, ip-172-31-3-155 MountVolume.SetUp succeeded for volume "default-token-mqlnn" Normal SandboxChanged 4m (x12 over 5m) kubelet, ip-172-31-3-155 Pod sandbox changed, it will be killed and re-created. Warning FailedCreatePodSandBox 14s (x100 over 5m) kubelet, ip-172-31-3-155 Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "hellotest-qjr5t_default" network: error getting ClusterInformation: Get https://10.43.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.43.0.1:443: getsockopt: no route to host ``` Other pod is attempted to start in the control node that is unavailable. Deploying a pod with some scale , results in all the pods getting deployed on the worker node and it get stuck in "ContainerCreating" state. ``` NAME READY STATUS RESTARTS AGE hello1-74f74757b9-5swtv 0/1 ContainerCreating 0 39m hello1-74f74757b9-8rdp2 0/1 ContainerCreating 0 39m hello1-74f74757b9-gz72c 0/1 ContainerCreating 0 39m hello1-74f74757b9-xql8l 0/1 ContainerCreating 0 39m ``` **Results:**

etlweather · May 31, 2018, 12:47am

And here is some more data the first point of Rancher itself.

github.com/rancher/rancher

Support Rancher authentication in RKE clusters without having to proxy through the Rancher server

opened 05:21PM - 24 May 18 UTC

closed 03:46PM - 05 Feb 19 UTC

gwiersma

kind/feature priority/0 internal area/cluster

When the rancher2.0 master goes down for some reason, is there a way to reach th…e kubernetes cluster directly with kubectl? I am currently running a project and one of the requirements is when the rancher master goes down , all the kubernetes clusters can be managed separately. If something happens to the Rancher master how can i manage my environments? I have tested this situation and i changed the ip from the master to one of the worker nodes in the kube configs. When i execute a command i get the message Unable to connect to the server: x509: cannot validate certificate for <IP> because it doesn't contain any IP SANs

etlweather · May 31, 2018, 12:49am

Also related, I configured the alerts to go to a Slack channel and I randomly get alerts that etcd is down yet the WebUI shows everything honky-dory.

etlweather · June 3, 2018, 1:46pm

"Cluster not available" shown in UI if worker node reboots

opened 07:23AM - 03 Jun 18 UTC

closed 10:06PM - 06 Sep 18 UTC

jomeier

kind/bug priority/1 status/more-info area/rke

Hi, what is the reason for rancher 2 UI to decide, when a cluster gets unavai…lable? I created a test cluster with rke consisting of three nodes: one with etcd and control, two nodes which are workers. Rancher 2 latest runs on a different server (not on this cluster). Than I tried to simulate destruction of nodes by rebooting them. If I reboot the 1st worker, rancher UI is fine and tells me after a few seconds, that this node is not available. If I reboot the 2nd worker, rancher UI tells me - in a big wide square - that my cluster is not available. I not even can "launch a kubectl" console in the UI because it's blocked. I assumed, that rebooting workers shouldn't make the whole cluster unavailable. It's intersting, that during both reboots I can access the cluster's API on a linux shell with kubectl. Only rancher UI acts strange. Greetz, Josef

willchan · June 12, 2018, 6:57pm

I have updated https://github.com/rancher/rancher/issues/13698 with a comment on making this better in a not-too-distant release before 2.1.

https://github.com/rancher/rancher/issues/13830 is something we could not reproduce and have asked the bug creator to help us out. If we can reproduce, we will fix it.

etlweather · June 12, 2018, 7:00pm

Thank you!

These answers go a long way to build confidence! I totally understand that such a major release may be a bit overwhelming - been there!

etlweather · June 12, 2018, 7:05pm

Maybe it will be covered in the updated documentation - if not, please make a note to document more what is the need of having Rancher in HA - obviously there is a resource cost and added complexity to have Rancher in HA. Same goes for having 1,3 or 5 nodes with etcd/controlplane.

Users will need to evaluate the resource cost versus the benefit versus the problems.

For example, if Rancher is down, or not reacheable by the nodes of a cluster, then I can’t deploy new stuff. OK, it obviously make sense as Rancher is auth central.

But it’s good to know that despite Rancher not being accessible, kubernetes still monitors the workloads of the cluster and will restart a pod if it crash. That’s what I care about the most.

I want to know that the cluster running in Australia will continue behaving with the current deployed containers when the Rancher instance in the US is down.

willchan · June 14, 2018, 4:38pm

Will do. That is in our plans to do so.

Topic		Replies	Views
Rancher HA setup Rancher	5	1006	May 17, 2019
Running Rancher Server in HA Rancher	3	1151	August 23, 2018
Upgrading Rancher without clusters going off-line Rancher	1	653	January 24, 2019
What happens to the clusters if the Rancher instance goes down? Rancher	8	3587	July 8, 2021
HA Installation advantage Rancher	3	906	June 27, 2018

Need for Rancher HA / Impact if down?

Related topics