We have what we would like to call a HA Rancher 2.4 installation on-premises hosting a production K8s cluster.
This cluster is accessed from the Bitbucket Cloud (hosting our source repository) via Bitbucket pipelines to run Continuous Integration builds in Pods hosted in the K8s cluster.
We have a load balancer between the BB cloud & the Rancher cluster.
Our on-premises K8s cluster currently consists of,
- 3 x (4CPU/8GB RAM) Control Nodes with etcd + controlplane roles
- 4 x (8CPU/24GB RAM) Storage Nodes with Rook-Ceph installed.
- 19 x (8CPU/32GB RAM) Worker nodes
Problem:
The K8s cluster becomes inaccessible occasionally. e.g. once/twice within a couple of days.
During this period all CI builds (Pods) hosted within the K8s cluster would crash and due to their nature need to be restarted.
Question:
Have we got the sizing right? Especially regarding the no. of Control Plane nodes (because we cannot reduce the no. of worker/storage nodes). If NOT, how many control nodes should we have approximately?
If it’s not the Control Nodes, what else could be causing the cluster to become inaccessible?
Some advice/tips would be very much appreciated!