A Kubernetes upgrade of a Rancher managed cluster is stuck in upgrade process. It started with the first node (All roles) and it seems that kubelet is unable to create a specific container (but does not indicate why).
Error message in UI:
[controlPlane] Failed to upgrade Control Plane: [[host node1 not ready]]
Logs from kubelet:
I0223 06:58:25.201695 24906 server.go:1126] Started kubelet
I0223 06:58:25.201808 24906 server.go:145] Starting to listen on 0.0.0.0:10250
E0223 06:58:25.201816 24906 kubelet.go:1308] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
I0223 06:58:25.202573 24906 server.go:393] Adding debug handlers to kubelet server.
I0223 06:58:25.203445 24906 fs_resource_analyzer.go:64] Starting FS ResourceAnalyzer
I0223 06:58:25.204065 24906 volume_manager.go:263] The desired_state_of_world populator starts
I0223 06:58:25.204072 24906 volume_manager.go:265] Starting Kubelet Volume Manager
I0223 06:58:25.204222 24906 reflector.go:175] Starting reflector *v1.CSIDriver (0s) from k8s.io/client-go/informers/factory.go:135
I0223 06:58:25.204460 24906 desired_state_of_world_populator.go:139] Desired state populator starts to run
I0223 06:58:25.281104 24906 factory.go:356] Registering Docker factory
I0223 06:58:25.281127 24906 factory.go:54] Registering systemd factory
I0223 06:58:25.281533 24906 clientconn.go:106] parsed scheme: "unix"
I0223 06:58:25.281588 24906 clientconn.go:106] scheme "unix" not registered, fallback to default scheme
I0223 06:58:25.281785 24906 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}
I0223 06:58:25.281798 24906 clientconn.go:933] ClientConn switching balancer to "pick_first"
I0223 06:58:25.282277 24906 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000af6590, {CONNECTING <nil>}
I0223 06:58:25.282336 24906 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000af6590, {READY <nil>}
I0223 06:58:25.282924 24906 factory.go:137] Registering containerd factory
I0223 06:58:25.283264 24906 factory.go:101] Registering Raw factory
I0223 06:58:25.283436 24906 manager.go:1158] Started watching for new ooms in manager
I0223 06:58:25.284974 24906 manager.go:272] Starting recovery of all containers
I0223 06:58:25.287829 24906 status_manager.go:158] Starting to sync pod status with apiserver
I0223 06:58:25.287861 24906 kubelet.go:1824] Starting kubelet main sync loop.
E0223 06:58:25.287920 24906 kubelet.go:1848] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
I0223 06:58:25.297282 24906 reflector.go:175] Starting reflector *v1beta1.RuntimeClass (0s) from k8s.io/client-go/informers/factory.go:135
I0223 06:58:25.310490 24906 kuberuntime_manager.go:995] updating runtime config through cri with podcidr 10.42.3.0/24
I0223 06:58:25.310619 24906 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach
I0223 06:58:25.312015 24906 docker_service.go:354] docker cri received runtime config &RuntimeConfig{NetworkConfig:&NetworkConfig{PodCidr:10.42.3.0/24,},}
I0223 06:58:25.312196 24906 kubelet_network.go:77] Setting Pod CIDR: -> 10.42.3.0/24
I0223 06:58:25.381145 24906 kubelet_node_status.go:486] Recording NodeHasSufficientMemory event message for node et-radoi01-p
I0223 06:58:25.381168 24906 kubelet_node_status.go:486] Recording NodeHasNoDiskPressure event message for node et-radoi01-p
I0223 06:58:25.381176 24906 kubelet_node_status.go:486] Recording NodeHasSufficientPID event message for node et-radoi01-p
I0223 06:58:25.381199 24906 kubelet_node_status.go:70] Attempting to register node et-radoi01-p
E0223 06:58:25.388147 24906 kubelet.go:1848] skipping pod synchronization - container runtime status check may not have completed yet
I0223 06:58:25.412151 24906 kubelet_node_status.go:112] Node et-radoi01-p was previously registered
I0223 06:58:25.412223 24906 kubelet_node_status.go:73] Successfully registered node et-radoi01-p
E0223 06:58:25.443308 24906 manager.go:1086] Failed to create existing container: /kubepods/besteffort/podc5295e14-cb35-11ea-b69a-0050568ddbc3/7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b: failed to inspect container "7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b": Error: No such container: 4d3b49e609022d772e54cacb5cb3f448783bf84489382a586801afcc7e7ad6ee
I0223 06:58:25.480701 24906 kubelet_node_status.go:486] Recording NodeNotSchedulable event message for node et-radoi01-p
E0223 06:58:25.589543 24906 kubelet.go:1848] skipping pod synchronization - container runtime status check may not have completed yet
I0223 06:58:25.823506 24906 manager.go:277] Recovery completed
E0223 06:58:25.991342 24906 kubelet.go:1848] skipping pod synchronization - container runtime status check may not have completed yet
W0223 06:58:26.058286 24906 docker_container.go:372] ignore error image "sha256:ee374eabb8a1a8396b007ee897e949083604a01c5795783357e362f188083c61" not found while inspecting docker container "7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b": no such image: "sha256:ee374eabb8a1a8396b007ee897e949083604a01c5795783357e362f188083c61"
E0223 06:58:26.062916 24906 manager.go:1086] Failed to create existing container: /kubepods/besteffort/podc5295e14-cb35-11ea-b69a-0050568ddbc3/7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b: failed to inspect container "7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b": Error: No such container: 4d3b49e609022d772e54cacb5cb3f448783bf84489382a586801afcc7e7ad6ee
W0223 06:58:26.064997 24906 manager.go:1131] Failed to process watch event {EventType:0 Name:/kubepods/besteffort/podc5295e14-cb35-11ea-b69a-0050568ddbc3/7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b WatchSource:0}: failed to inspect container "7175b48db7554760181c851bdcda39437b95b4a0717d2670997f3a5b825e9e5b": Error: No such container: 4d3b49e609022d772e54cacb5cb3f448783bf84489382a586801afcc7e7ad6ee
F0223 06:58:26.119027 24906 kubelet.go:1386] Failed to start ContainerManager failed to build map of initial containers from runtime: no PodsandBox found with Id '4d3b49e609022d772e54cacb5cb3f448783bf84489382a586801afcc7e7ad6ee'
This basically happens in a loop. Rancher simply restarts the kubelet container everytime this happens.
I’ve been through dozens of these ugprades, this is the first time a node is stuck.
This is on Rancher 2.4.13.
Any ideas what the reason is and how to solve this?