Metrics-server with CrashLoopBackOff

this is a new rke install.

I am just thinking if this is happening because of a secure/nonsecure certificate. Can anyone please give me some light on this . and what to check to see why it’s failing?

Everything seems ok but metric-server is in CrashLookBackOff

kubectl get pods metrics-server-68f5f9b7df-v4f7v -n kube-system
NAME READY STATUS RESTARTS AGE
metrics-server-68f5f9b7df-v4f7v 0/1 CrashLoopBackOff 71 3h25m

n the logs, I only see this

kubectl logs  metrics-server-68f5f9b7df-v4f7v -n kube-system
I0816 10:39:58.263020       1 secure_serving.go:116] Serving securely on [::]:4443

Describe pods

ame:                 metrics-server-68f5f9b7df-v4f7v
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 192.168.0.21/10.172.140.21
Start Time:           Mon, 16 Aug 2021 07:12:28 +0000
Labels:               k8s-app=metrics-server
                      pod-template-hash=68f5f9b7df
Annotations:          cni.projectcalico.org/podIP: 10.42.5.2/32
                      cni.projectcalico.org/podIPs: 10.42.5.2/32
Status:               Running
IP:                   10.42.5.2
IPs:
  IP:           10.42.5.2
Controlled By:  ReplicaSet/metrics-server-68f5f9b7df
Containers:
  metrics-server:
    Container ID:  docker://693be882b880f65e0ec485f5246300fedfc3f6e68399ece19848730fb5e2eecd
    Image:         192.168.0.39:10004/rancher/metrics-server:v0.3.6
    Image ID:      docker-pullable://192.168.0.39:10004/rancher/metrics-server@sha256:c9c4e95068b51d6b33a9dccc61875df07dc650abbf4ac1a19d58b4628f89288b
    Port:          4443/TCP
    Host Port:     0/TCP
    Args:
      --cert-dir=/tmp
      --secure-port=4443
      --kubelet-insecure-tls
      --kubelet-preferred-address-types=InternalIP
      --logtostderr
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 16 Aug 2021 10:39:57 +0000
      Finished:     Mon, 16 Aug 2021 10:40:27 +0000
    Ready:          False
    Restart Count:  73
    Liveness:       http-get https://:https/livez delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get https://:https/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /tmp from tmp-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from metrics-server-token-qt466 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  tmp-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  metrics-server-token-qt466:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  metrics-server-token-qt466
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     :NoExecuteop=Exists
                 :NoScheduleop=Exists
Events:
  Type     Reason     Age                      From     Message
  ----     ------     ----                     ----     -------
  Warning  Unhealthy  44m (x154 over 3h29m)    kubelet  Readiness probe failed: HTTP probe failed with statuscode: 404
  Normal   Pulled     14m (x68 over 3h29m)     kubelet  Container image "192.168.0.39:10004/rancher/metrics-server:v0.3.6" already present on machine
  Warning  BackOff    4m27s (x829 over 3h27m)  kubelet  Back-off restarting failed container

I’ve found probe timeouts on metrics-server too aggressive:

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readyz
            port: https
            scheme: HTTPS
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

What happens is metrics-server returns “ok” on both livez and readyz but the requests takes more than one second to process:

$ time curl -k https://SNIPPED:4443/livez
ok
real    0m3.081s
user    0m0.031s
sys     0m0.005s
$ time curl -k https://SNIPPED:4443/readyz
ok
real    0m3.206s
user    0m0.020s
sys     0m0.013s

Since 3 seconds is greater than 1 second, it’s not “live” and not “ready”.

I’ve no idea why it’s taking 3 seconds to respond but this the core issue why CrashLoopBackOff is happening.