Rancher K8s cluster sizing advice needed

appledie · November 27, 2020, 3:22pm

We have what we would like to call a HA Rancher 2.4 installation on-premises hosting a production K8s cluster.
This cluster is accessed from the Bitbucket Cloud (hosting our source repository) via Bitbucket pipelines to run Continuous Integration builds in Pods hosted in the K8s cluster.

We have a load balancer between the BB cloud & the Rancher cluster.

Our on-premises K8s cluster currently consists of,

3 x (4CPU/8GB RAM) Control Nodes with etcd + controlplane roles
4 x (8CPU/24GB RAM) Storage Nodes with Rook-Ceph installed.
19 x (8CPU/32GB RAM) Worker nodes

Problem:
The K8s cluster becomes inaccessible occasionally. e.g. once/twice within a couple of days.
During this period all CI builds (Pods) hosted within the K8s cluster would crash and due to their nature need to be restarted.

Question:
Have we got the sizing right? Especially regarding the no. of Control Plane nodes (because we cannot reduce the no. of worker/storage nodes). If NOT, how many control nodes should we have approximately?
If it’s not the Control Nodes, what else could be causing the cluster to become inaccessible?

Some advice/tips would be very much appreciated!

Topic		Replies	Views
Cattle-pods failing Rancher	2	1786	October 25, 2019
Rancher HA setup Rancher	5	1006	May 17, 2019
Solving a Customer Cluster problem: Failed to reconcile etcd plane: Etcd plane nodes are replaced Rancher	0	2314	April 24, 2020
Questions about Cluster/Projets/Namespaces/Nodes organisation	0	494	April 9, 2020
High Availability & System Workloads scale = 1 Rancher	1	553	November 18, 2020

Rancher K8s cluster sizing advice needed

Related topics