Kube-apiserver, etcd problems with Rancher and Harbor HA

Neil_Williams · November 23, 2022, 4:43pm

Hi,

We are making a script to build a K8 management cluster but have hit a problem. We have been trying to fix for a couple of week and have almost tracked down the fix. But not quite there yet. I was wondering if someone else has had this problem or has a fix.

Build = openstack, ubuntu 20.04, rancher 2.6 installed onto standard K8 cluster, +Harbour HA operator build (standard stack). 3 masters,3 works. etcd on masters.
Harbor install = harbor-operator/kustomization-all-in-one.md at master · goharbor/harbor-operator · GitHub

When - We install Rancher and everting’s is OK. But when harbour is install next, the problems start.

Fault - on the masters server, disk log wright is 60% all the time and CPU (kube-apiserver) about 50% all the time.
Fault2 - Due due to high IO. etcd misses sequential updates to log number, so it cant compact. so etcd brakes size limit and stops. We have to manual recompact to fix.
Fault3 - Random pod’s restart on master’s due to load
Fault4 - Master run out of space due to 5GB of logs per day.

Possible coarse - We think Harbor overrides annotations Rancher creates on the mutating-webhook-configuration CRD or cert-manager. So its fighting the same webhooks thus creating massive logs/disk-hits.

Anyone have a fix

LOG Nov 23 03:06:28 ef8206-server-1-2 kube-apiserver.daemon[181328]: I1123 03:06:28.788737 181328 get.go:260] “Starting watch” path=“/api/v1/namespaces/cattle-system/secrets” resourceVersion=“2495730” labels=“” fields=“metadata.name=rancher-webhook-tls” timeout=“6m30s”
Nov 23 03:06:28 ef8206-server-1-2 kube-apiserver.daemon[181328]: I1123 03:06:28.811962 181328 httplog.go:109] “HTTP” verb=“GET” URI=“/api/v1/namespaces/cattle-system/secrets?allowWatchBookmarks=true&fieldSelector=metadata.name%3Drancher-webhook-tls&resourceVersion=2495730&timeout=6m30s&timeoutSeconds=390&watch=true” latency=“23.460232ms” userAgent=“kubelet/v1.22.16 (linux/amd64) kubernetes/b28e1f3” audit-ID=“3ee14e8c-fda4-46e7-8849-939e2b66d412” srcIP=“10.2.123.173:49470” resp=0

Neil_Williams · December 7, 2022, 4:18pm

we fixed the problem by renaming the line’s - mutating/validating-webhook-configuration. In the deployment.yaml . I think the Helm version already has this in.

Topic		Replies	Views
HELP - Rancher HA - Cluster local - KO ETCD	0	737	March 10, 2020
Automatic etcd snapshots are failing Rancher	1	858	February 10, 2020
Fresh install but stuck in "Provisioning"? Rancher	1	1533	March 9, 2021
Corrupted etcd?	0	904	March 29, 2022
Kubernetes environment Rancher 1.x	6	1073	September 6, 2016

Kube-apiserver, etcd problems with Rancher and Harbor HA

Related topics