Kube-apiserver, etcd problems with Rancher and Harbor HA

Hi,

We are making a script to build a K8 management cluster but have hit a problem. We have been trying to fix for a couple of week and have almost tracked down the fix. But not quite there yet. I was wondering if someone else has had this problem or has a fix.

Build = openstack, ubuntu 20.04, rancher 2.6 installed onto standard K8 cluster, +Harbour HA operator build (standard stack). 3 masters,3 works. etcd on masters.
Harbor install = harbor-operator/kustomization-all-in-one.md at master · goharbor/harbor-operator · GitHub

When - We install Rancher and everting’s is OK. But when harbour is install next, the problems start.

Fault - on the masters server, disk log wright is 60% all the time and CPU (kube-apiserver) about 50% all the time.
Fault2 - Due due to high IO. etcd misses sequential updates to log number, so it cant compact. so etcd brakes size limit and stops. We have to manual recompact to fix.
Fault3 - Random pod’s restart on master’s due to load
Fault4 - Master run out of space due to 5GB of logs per day.

Possible coarse - We think Harbor overrides annotations Rancher creates on the mutating-webhook-configuration CRD or cert-manager. So its fighting the same webhooks thus creating massive logs/disk-hits.

Anyone have a fix :slight_smile:

LOG Nov 23 03:06:28 ef8206-server-1-2 kube-apiserver.daemon[181328]: I1123 03:06:28.788737 181328 get.go:260] “Starting watch” path=“/api/v1/namespaces/cattle-system/secrets” resourceVersion=“2495730” labels=“” fields=“metadata.name=rancher-webhook-tls” timeout=“6m30s”
Nov 23 03:06:28 ef8206-server-1-2 kube-apiserver.daemon[181328]: I1123 03:06:28.811962 181328 httplog.go:109] “HTTP” verb=“GET” URI=“/api/v1/namespaces/cattle-system/secrets?allowWatchBookmarks=true&fieldSelector=metadata.name%3Drancher-webhook-tls&resourceVersion=2495730&timeout=6m30s&timeoutSeconds=390&watch=true” latency=“23.460232ms” userAgent=“kubelet/v1.22.16 (linux/amd64) kubernetes/b28e1f3” audit-ID=“3ee14e8c-fda4-46e7-8849-939e2b66d412” srcIP=“10.2.123.173:49470” resp=0

we fixed the problem by renaming the line’s - mutating/validating-webhook-configuration. In the deployment.yaml . I think the Helm version already has this in.