Kubelet timeout generating ImagePullBackOff error

TL;DR

Where is the kubelet config file on Rancher 2.6.9 - RKE1, like this Set Kubelet parameters via a config file | Kubernetes
Can I manage it? Does this file exist?
Didn’t find it in /var/lib/kubelet

# pwd
/var/lib/kubelet
# ls -lha
total 16K
drwxr-xr-x   9 root root  185 Sep  5 13:20 .
drwxr-xr-x. 42 root root 4.0K Sep 22 15:50 ..
-rw-------   1 root root   62 Sep  5 13:20 cpu_manager_state
drwxr-xr-x   2 root root   45 Nov  1 11:27 device-plugins
-rw-------   1 root root   61 Sep  5 13:20 memory_manager_state
drwxr-xr-x   2 root root   44 Sep  5 13:20 pki
drwxr-x---   2 root root    6 Sep  5 13:20 plugins
drwxr-x---   2 root root    6 Sep  5 13:20 plugins_registry
drwxr-x---   2 root root   26 Nov  1 11:27 pod-resources
drwxr-x---  11 root root 4.0K Oct 24 23:57 pods
drwxr-xr-x   2 root root    6 Sep  5 13:20 volumeplugins

Explain

Recently we’ve upgraded the Kubernetes version to v1.24.4-rancher1-1 and to Rancher 2.6.9. Everything worked fine, but recently we’ve noticed a new behavior: If a image is to big or takes more than 2 minutes to accomplish the download, the Kubernetes raise an ErrImagePull.
To bypass this error, I need to login to the cluster, do a docker pull <image> to stop this error.

Error: ImagePullBackOff

~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7                                                                                                           
NAME                                 READY   STATUS              RESTARTS   AGE
test-imagepullback-c7fc59d86-gwtc7   0/1     ContainerCreating   0          2m

~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7                                                                                                            
NAME                                 READY   STATUS         RESTARTS   AGE
test-imagepullback-c7fc59d86-gwtc7   0/1     ErrImagePull   0          2m1s
                                                                                                                                                      
~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7                                                                                                           
NAME                                 READY   STATUS             RESTARTS   AGE
test-imagepullback-c7fc59d86-gwtc7   0/1     ImagePullBackOff   0          2m12s

Searching for the problem, we discovered that the error is caused by a timeout in kubelet’s request (2 minutes, accourding to the doc kubelet | Kubernetes), which could be raised with a flag –runtime-request-timeout duration. Changing the cluster.yaml with the below parameters, nothing happens:

[...]
    kubelet:
      extra_args:
        runtime-request-timeout: 10m
      fail_swap_on: false
[...]

The process running, showing that the parameter reflects to kubelet configuration

# ps -ef | grep runtime-request-timeout
root      7286  7267  0 Nov01 ?        00:00:00 /bin/bash /opt/rke-tools/entrypoint.sh kubelet {...} --runtime-request-timeout=10m {...}

In the official page, this parameter is deprecated, which explains this behavior, and to change it I need to alter a parameter named runtimeRequestTimeout inside a config-file.
So I have some doubts:

  • Where I change it?
  • This file exist in Rancher or I need to create it?
  • Is there a way to bypass with another parameter in extra_args?
  • Why this is happening now? Is because the deprecation of dockershim?

Docker and kubernetes version

# docker version
Client: Docker Engine - Community
 Version:           20.10.12
 API version:       1.41
 Go version:        go1.16.12
 Git commit:        e91ed57
 Built:             Mon Dec 13 11:45:41 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.21
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       3056208
  Built:            Tue Oct 25 18:02:38 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.9
  GitCommit:        1c90a442489720eec95342e1789ee8a5e1b9536f
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

#  kubectl version --short
Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.24.4

I would be grateful if this help me and others to solve this annoying issue.

1 Like

You haven’t said where the delay is. Are you pulling images from a public registry or your own internal private registry. If the former you might want to consider setting up your own (I would strongly advise that from a security perspective regardless) and/or setting up a mirror cache. If it is your own internal registry, where is the bottleneck (network latency, I/O, …) ?

You might want to consider pre-pulling images onto hosts. There are a lot of different (automated) ways of achieving that. It also doesn’t matter greatly even if the pulled image isn’t the exact version you need since it is likely to share layers with other versions. We did something like this for Windows hosts where some images were several GB and we didn’t want to suffer the launch cost when a pod was created on a node that hadn’t seen it before (sometimes caused by the scheduler).

I appreciate it doesn’t resolve the question you are asking, but TBH I would question the need for (a) images that large (I understand it is sometimes unavoidable - but if they are your own you might want to look at ways of reducing the size such as multi-stage builds, dockerslim, factoring out binaries into separate images and using shared volumes to mount them, etc….), (b) whether just upping the value for the timeout is really a good solution (what new value would you use, … what happens when that isn’t enough ?, (c) as you will be aware timeouts tend to cause a ripple effect, so a long launch time for a container may very well cause subsequent timeout problems elsewhere (so personally I would focus on reducing the image load time rather than extending it !)

HtHs

Fraser.

Thanks for your response, I’m trying to solve this issue since that day, but no success ideed.
I’ve opened the issue in RKE github repo too, which has a discussion about the problem (you can see it here Kubelet timeout generating ImagePullBackOff error · Issue #3084 · rancher/rke · GitHub) and probably we found what is going on.
It’s probably something related with the version of cri-dockerd inside the rke-tools image (Docker Hub), which can’t be upgraded unless the new version of this image is released supporting the new version of cri.
By now, I’m pulling the images manually if this 2m timeout exceed.
And thank you for the tips, I’ll analyze it and try to implement if possible!