Rancher Kubernetes upgrade stuck on a node (failed to upgrade control plane)

A Kubernetes upgrade of a Rancher managed cluster is stuck in upgrade process. It started with the first node (All roles) and it seems that kubelet is unable to create a specific container (but does not indicate why).

Error message in UI:

[controlPlane] Failed to upgrade Control Plane: [[host node1 not ready]]

Logs from kubelet:

I0223 06:58:25.201695   24906 server.go:1126] Started kubelet
I0223 06:58:25.201808   24906 server.go:145] Starting to listen on 0.0.0.0:10250
E0223 06:58:25.201816   24906 kubelet.go:1308] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
I0223 06:58:25.202573   24906 server.go:393] Adding debug handlers to kubelet server.
I0223 06:58:25.203445   24906 fs_resource_analyzer.go:64] Starting FS ResourceAnalyzer
I0223 06:58:25.204065   24906 volume_manager.go:263] The desired_state_of_world populator starts
I0223 06:58:25.204072   24906 volume_manager.go:265] Starting Kubelet Volume Manager
I0223 06:58:25.204222   24906 reflector.go:175] Starting reflector *v1.CSIDriver (0s) from k8s.io/client-go/informers/factory.go:135
I0223 06:58:25.204460   24906 desired_state_of_world_populator.go:139] Desired state populator starts to run
I0223 06:58:25.281104   24906 factory.go:356] Registering Docker factory
I0223 06:58:25.281127   24906 factory.go:54] Registering systemd factory
I0223 06:58:25.281533   24906 clientconn.go:106] parsed scheme: "unix"
I0223 06:58:25.281588   24906 clientconn.go:106] scheme "unix" not registered, fallback to default scheme
I0223 06:58:25.281785   24906 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
I0223 06:58:25.281798   24906 clientconn.go:933] ClientConn switching balancer to "pick_first"
I0223 06:58:25.282277   24906 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000af6590, {CONNECTING <nil>}
I0223 06:58:25.282336   24906 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000af6590, {READY <nil>}
I0223 06:58:25.282924   24906 factory.go:137] Registering containerd factory
I0223 06:58:25.283264   24906 factory.go:101] Registering Raw factory
I0223 06:58:25.283436   24906 manager.go:1158] Started watching for new ooms in manager
I0223 06:58:25.284974   24906 manager.go:272] Starting recovery of all containers
I0223 06:58:25.287829   24906 status_manager.go:158] Starting to sync pod status with apiserver
I0223 06:58:25.287861   24906 kubelet.go:1824] Starting kubelet main sync loop.
E0223 06:58:25.287920   24906 kubelet.go:1848] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
I0223 06:58:25.297282   24906 reflector.go:175] Starting reflector *v1beta1.RuntimeClass (0s) from k8s.io/client-go/informers/factory.go:135
I0223 06:58:25.310490   24906 kuberuntime_manager.go:995] updating runtime config through cri with podcidr 10.42.3.0/24
I0223 06:58:25.310619   24906 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach
I0223 06:58:25.312015   24906 docker_service.go:354] docker cri received runtime config &RuntimeConfig{NetworkConfig:&NetworkConfig{PodCidr:10.42.3.0/24,},}
I0223 06:58:25.312196   24906 kubelet_network.go:77] Setting Pod CIDR:  -> 10.42.3.0/24
I0223 06:58:25.381145   24906 kubelet_node_status.go:486] Recording NodeHasSufficientMemory event message for node et-radoi01-p
I0223 06:58:25.381168   24906 kubelet_node_status.go:486] Recording NodeHasNoDiskPressure event message for node et-radoi01-p
I0223 06:58:25.381176   24906 kubelet_node_status.go:486] Recording NodeHasSufficientPID event message for node et-radoi01-p
I0223 06:58:25.381199   24906 kubelet_node_status.go:70] Attempting to register node et-radoi01-p
E0223 06:58:25.388147   24906 kubelet.go:1848] skipping pod synchronization - container runtime status check may not have completed yet
I0223 06:58:25.412151   24906 kubelet_node_status.go:112] Node et-radoi01-p was previously registered
I0223 06:58:25.412223   24906 kubelet_node_status.go:73] Successfully registered node et-radoi01-p
E0223 06:58:25.443308   24906 manager.go:1086] Failed to create existing container: /kubepods/besteffort/podc5295e14-cb35-11ea-b69a-0050568ddbc3/7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b: failed to inspect container "7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b": Error: No such container: 4d3b49e609022d772e54cacb5cb3f448783bf84489382a586801afcc7e7ad6ee
I0223 06:58:25.480701   24906 kubelet_node_status.go:486] Recording NodeNotSchedulable event message for node et-radoi01-p
E0223 06:58:25.589543   24906 kubelet.go:1848] skipping pod synchronization - container runtime status check may not have completed yet
I0223 06:58:25.823506   24906 manager.go:277] Recovery completed
E0223 06:58:25.991342   24906 kubelet.go:1848] skipping pod synchronization - container runtime status check may not have completed yet
W0223 06:58:26.058286   24906 docker_container.go:372] ignore error image "sha256:ee374eabb8a1a8396b007ee897e949083604a01c5795783357e362f188083c61" not found while inspecting docker container "7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b": no such image: "sha256:ee374eabb8a1a8396b007ee897e949083604a01c5795783357e362f188083c61"
E0223 06:58:26.062916   24906 manager.go:1086] Failed to create existing container: /kubepods/besteffort/podc5295e14-cb35-11ea-b69a-0050568ddbc3/7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b: failed to inspect container "7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b": Error: No such container: 4d3b49e609022d772e54cacb5cb3f448783bf84489382a586801afcc7e7ad6ee
W0223 06:58:26.064997   24906 manager.go:1131] Failed to process watch event {EventType:0 Name:/kubepods/besteffort/podc5295e14-cb35-11ea-b69a-0050568ddbc3/7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b WatchSource:0}: failed to inspect container "7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b": Error: No such container: 4d3b49e609022d772e54cacb5cb3f448783bf84489382a586801afcc7e7ad6ee
F0223 06:58:26.119027   24906 kubelet.go:1386] Failed to start ContainerManager failed to build map of initial containers from runtime: no PodsandBox found with Id '4d3b49e609022d772e54cacb5cb3f448783bf84489382a586801afcc7e7ad6ee'

This basically happens in a loop. Rancher simply restarts the kubelet container everytime this happens.

I’ve been through dozens of these ugprades, this is the first time a node is stuck.

This is on Rancher 2.4.13.

Any ideas what the reason is and how to solve this?

I managed to solve the stuck upgrade process by deleting this particular node from the cluster. The upgrade then went on but then was stuck again on the next node. It was here that I found that Prometheus, meanwhile moved on from the removed node1 to node2, is causing the problem.
Prometheus was doing heavy work, slowing down the node and caused healthchecks needed for the upgrade to timeout.
Once manually deleted Prometheus containers on that node, the upgrade continued.