No redundancy of etcd/control plane in 5-node kubernetes cluster

vinci · February 12, 2019, 4:31pm

I have installed a single-node Rancher 2.1.6 with a 5-node Kubernetes Cluster. Two of the nodes are etcd and Control Plane, the other 3 are workers.
All nodes (including rancher server) run Ubuntu 16.04.5 with Docker 17.03.2.
If I shutdown any of the nodes on which etcd/Control Plane run, then the Kubernetes Cluster becomes unreachable altogether. Acessing the node leads to:

This cluster is currently Unavailable ; areas that interact directly with it will not be available until the API is ready.

Failed to communicate with API server: Get https://public_ip:6443/api/v1/componentstatuses?timeout=30s: dial tcp public_ip:6443: connect: no route to host

My question is, how do I achieve the necessary redundancy, so that, if one of the two etcd/control plane nodes fails, the other one can take over?
Should I have distributed the roles differently? It does seem consistent with the documentation, but I know this isn’t fixed at all. Someone recommended that etcd could reside on the same nodes as the workers.

vincent · February 12, 2019, 4:43pm

etcd requires a strict majority of nodes (i.e. GREATER then 50%) to be available to function. So you want an odd number of nodes with that role; in practice, 3 or 5, and those can survive 1 or 2 nodes failing, respectively.

Having 2 is actually worse than 1, you get the same (non)tolerance for failure (a max of 0 can fail before the cluster is down, because 1/2 is not > 50%) but twice as many machines are now involved that can each fall.

vinci · February 13, 2019, 9:54am

So what you’re saying is that, basically, instead of having redundancy, the information is spread over the two nodes - so part of the information is on one node and the other on the other node. As in, let’s say, raid 0. Do I understand correctly?
If so, then how was this “decision” made when creating the cluster? Why isn’t it redundant just as it is, with the risk of having corrupt information? I’m guessing there’s a mechanism that makes sure it works like that?

vincent · February 13, 2019, 7:31pm

No they both contain a copy of the whole database, and whichever one came up first is the leader. But if one node fails and you only had 2, there is no longer a quorum of members available, so the cluster is unusable. You need at least 3 hosts running etcd, and even numbers are a waste because they do not increase the quorum count. So 3, 5, or if you really want 7.

vinci · February 15, 2019, 12:16pm

Thank you. That’s very helpful information.
So maybe a 6-node cluster with 3 etcd/control plane nodes and another 3 worker/minions nodes would be better? In a 5-node cluster the other option would be to share the etcd with one worker, but if that worker is overloaded, the I might get into trouble managing the cluster (not being able to remove nodes, things like that).

superseb · February 15, 2019, 1:20pm

Designing production ready clusters is described at https://rancher.com/docs/rancher/v2.x/en/cluster-provisioning/production/

Vlado_Djerek · February 18, 2019, 11:45am

Dont go with even numbers, as vincent explained even numbers dont increase quorum count

vinci · February 18, 2019, 12:20pm

Aren’t you supposed to contextualize a little bit? There are 3 etcd/control panel - that’s not even. And there are 3 workers - that’s also not even. Maybe I’m missing something and there’s a relationship between workers and etcd as far as the high availability is concerned (and I’m not talking about the load).

Vlado_Djerek · February 18, 2019, 12:45pm

I saw 6 and counted etcd as 6, sorry

Fraser_Goffin · February 22, 2019, 8:56am

Odd numbers for Etcd definitely, but as I understand it that is not a requirement for the Control-Plane nodes, so you can have 2 of those and still have a level of resilience. Likewise for worker nodes, although of course it’s much more likely you will have many of those.

Topic		Replies	Views
RKE HA understanding Rancher	2	445	October 9, 2020
Redundancy to imported cluster Rancher	1	320	December 21, 2021
[SOLVED] Remove failed ETCD node Rancher	0	1998	October 13, 2021
My Cluster is about to die - Need Help Rancher 1.x	2	870	November 17, 2022
About the dimension of controlplane und etcd nodes Rancher	5	3343	November 19, 2018

No redundancy of etcd/control plane in 5-node kubernetes cluster

Related topics