Hi All,
I was running Rancher 2.2.8 with K8s v1.15.3. Upgraded to Rancher 2.3.5 and upgraded the K8s clusters to version 1.16.8. This is a custom inhouse cluster running on CoreOS. The upgrade went bad, and now i have some some nodes on the new version some on the old, and nothing works correctly.
> kubectl get nodes
NAME STATUS ROLES AGE VERSION
us2-k8smgmt01 Ready controlplane,etcd 7d23h v1.16.8
us2-k8smgmt02 Ready controlplane,etcd 8d v1.16.8
us2-k8smgmt03 Ready controlplane,etcd 8d v1.16.8
us2-k8swkr01 Ready <none> 8d v1.15.3
us2-k8swkr02 Ready <none> 8d v1.15.3
> kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
cattle-system cattle-cluster-agent-69b78db68c-zblhf 0/1 ContainerCreating 0 71m
cattle-system cattle-cluster-agent-6d479b7989-hcn8w 0/1 ContainerCreating 0 70m
cattle-system cattle-node-agent-99z8c 1/1 Running 2 7d22h
cattle-system cattle-node-agent-kdvlg 1/1 Running 2 7d22h
cattle-system cattle-node-agent-pxf9w 1/1 Running 0 108m
cattle-system cattle-node-agent-sjq6n 1/1 Running 1 7d19h
cattle-system cattle-node-agent-xxhv5 1/1 Running 0 121m
cattle-system kube-api-auth-jmpts 1/1 Running 2 7d22h
cattle-system kube-api-auth-x97qh 1/1 Running 1 7d19h
cattle-system kube-api-auth-zr5xz 1/1 Running 2 7d22h
ingress-nginx default-http-backend-55c845698b-rzbpv 0/1 Pending 0 7d21h
ingress-nginx default-http-backend-6b9ff64bb8-5cpdn 0/1 Pending 0 101m
kube-system canal-n6b84 0/2 Init:CrashLoopBackOff 8 18m
kube-system canal-rs6h8 0/2 Init:CrashLoopBackOff 8 18m
kube-system canal-sxddd 0/2 Init:CrashLoopBackOff 8 18m
kube-system canal-vbv5w 0/2 Init:CrashLoopBackOff 8 18m
kube-system canal-wzgpt 0/2 Init:CrashLoopBackOff 8 18m
kube-system coredns-84f569cb6d-cgrbt 0/1 Pending 0 7d21h
kube-system coredns-autoscaler-579dd56944-7nrj5 0/1 Pending 0 7d21h
kube-system coredns-autoscaler-f78bc4f7d-9r68p 0/1 Pending 0 7d21h
kube-system metrics-server-676c489dc7-hz4wj 0/1 Pending 0 7d21h
kube-system metrics-server-c9cfdd487-w8dvg 0/1 Pending 0 7d21h
kube-system rke-coredns-addon-deploy-job-l2b2c 0/1 Completed 0 8d
kube-system rke-ingress-controller-deploy-job-7xx5h 0/1 Completed 0 8d
kube-system rke-metrics-addon-deploy-job-pbsjz 0/1 Completed 0 8d
kube-system rke-network-plugin-deploy-job-gwssp 0/1 Completed 0 7d22h
kubectl -n kube-system logs canal-n6b84 --all-containers
ls: cannot access '/calico-secrets': No such file or directory
Wrote Calico CNI binaries to /host/opt/cni/bin
CNI plugin version: v3.13.0
/host/secondary-bin-dir is non-writeable, skipping
Using CNI config template from CNI_NETWORK_CONFIG environment variable.
CNI config: {
"name": "k8s-pod-network",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "calico",
"log_level": "WARNING",
"datastore_type": "kubernetes",
"nodename": "us2-k8smgmt02",
"mtu": 1450,
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/kubernetes/ssl/kubecfg-kube-node.yaml"
}
},
{
"type": "portmap",
"snat": true,
"capabilities": {"portMappings": true}
},
{
"type": "bandwidth",
"capabilities": {"bandwidth": true}
}
]
}
Created CNI config 10-canal.conflist
Done configuring CNI. Sleep=false
failed to try resolving symlinks in path "/var/log/pods/kube-system_canal-n6b84_7c0ceb58-ca3d-41e8-b111-a46729c19837/flexvol-driver/8.log": lstat /var/log/pods/kube-system_canal-n6b84_7c0ceb58-ca3d-41e8-b111-a46729c19837/flexvol-driver/8.log: no such file or directoryError from server (BadRequest): container "calico-node" in pod "canal-n6b84" is waiting to start: PodInitializing
kubectl describe pod -n kube-system canal-n6b84
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 24m default-scheduler Successfully assigned kube-system/canal-n6b84 to us2-k8smgmt02
Normal Pulled 24m kubelet, us2-k8smgmt02 Container image "rancher/calico-cni:v3.13.0" already present on machine
Normal Created 24m kubelet, us2-k8smgmt02 Created container install-cni
Normal Started 24m kubelet, us2-k8smgmt02 Started container install-cni
Normal Pulled 23m (x5 over 24m) kubelet, us2-k8smgmt02 Container image "rancher/calico-pod2daemon-flexvol:v3.13.0" already present on machine
Normal Created 23m (x5 over 24m) kubelet, us2-k8smgmt02 Created container flexvol-driver
Warning Failed 23m (x5 over 24m) kubelet, us2-k8smgmt02 Error: failed to start container "flexvol-driver": Error response from daemon: error while creating mount source path '/usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds': mkdir /usr/libexec/kubernetes: read-only file system
Warning BackOff 4m27s (x91 over 24m) kubelet, us2-k8smgmt02 Back-off restarting failed container
Any idea where to start ?
UPDATE:
Upgraded the path as suggested here:
Upgraded also Rancher to v2.3.6
This did fix canal, but nothing else worked, as coredns and other pods could not the scheduled. When running kubectl get nodes the worker nodes did not have any label anymore, so I ran:
kubectl label node us2-k8swkr02 node-role.kubernetes.io/worker=worker
kubectl label node us2-k8swkr01 node-role.kubernetes.io/worker=worker
suddenly the pods started to schedule.
Now I’m facing 3 other issues:
- namespaces are gone
- the two nodes are still on the old K8s version
- the workers nodes can’t be deleted from the UI, the delete button is grayed out.
maybe: Deleted node resurrected and delete button is then disabled · Issue #25242 · rancher/rancher · GitHub