I created my first RKE cluster on AWS using Rancher, with only 3 nodes (with both, master and worker roles). To do some testings (and avoid the cost of using 3 nodes all the time), I stopped one of the nodes. At that time, the cluster appeared as “un-avaiable” from Rancher, and could not longer access to it. I have to started the node again. Why ? I do not understand the reason, because being a cluster, it should be all the time active, no matter if you have the 3 nodes or just 1 running…
K3S allows 2 master nodes. My understanding is that RKE & RKE2 require an odd number, so maybe it’s trying to get that fixed before allowing new things? I’d think it should keep what’s already there, though.
So with RKE we needed an odd number of master nodes for etcd because etcd needs quorum in order to work. With k3s, we added the kine etcd adapter https://github.com/k3s-io/kine which allows you to use other databases like dqlite, Postgres, or MySQL in-place of etcd which are externally managed. By doing this k3s only needs two master nodes because they are only handling the control-plane roles (kube-apiserver, kube-scheduler, kube-controller-manager, etc) and these services are all really only active on a single node at a time (kube-apiserver is active on all nodes at all times but the other services like kube-scheduler have a leader election process wherein only one node is active at a time for that service.)
For RKE2, uses etcd so the same rules apply when needing an odd number of master nodes. Note: Currently RKE2 doesn’t have kine support (see Feature/Question Consolidated etcd · Issue #453 · rancher/rke2 · GitHub for more details).