Failed K8s upgrade from v1.15 to 1.16 - gone bad

Hi All,

I was running Rancher 2.2.8 with K8s v1.15.3. Upgraded to Rancher 2.3.5 and upgraded the K8s clusters to version 1.16.8. This is a custom inhouse cluster running on CoreOS. The upgrade went bad, and now i have some some nodes on the new version some on the old, and nothing works correctly.

> kubectl get nodes
NAME            STATUS   ROLES               AGE     VERSION
us2-k8smgmt01   Ready    controlplane,etcd   7d23h   v1.16.8
us2-k8smgmt02   Ready    controlplane,etcd   8d      v1.16.8
us2-k8smgmt03   Ready    controlplane,etcd   8d      v1.16.8
us2-k8swkr01    Ready    <none>              8d      v1.15.3
us2-k8swkr02    Ready    <none>              8d      v1.15.3
> kubectl get pods --all-namespaces
NAMESPACE       NAME                                      READY   STATUS                  RESTARTS   AGE
cattle-system   cattle-cluster-agent-69b78db68c-zblhf     0/1     ContainerCreating       0          71m
cattle-system   cattle-cluster-agent-6d479b7989-hcn8w     0/1     ContainerCreating       0          70m
cattle-system   cattle-node-agent-99z8c                   1/1     Running                 2          7d22h
cattle-system   cattle-node-agent-kdvlg                   1/1     Running                 2          7d22h
cattle-system   cattle-node-agent-pxf9w                   1/1     Running                 0          108m
cattle-system   cattle-node-agent-sjq6n                   1/1     Running                 1          7d19h
cattle-system   cattle-node-agent-xxhv5                   1/1     Running                 0          121m
cattle-system   kube-api-auth-jmpts                       1/1     Running                 2          7d22h
cattle-system   kube-api-auth-x97qh                       1/1     Running                 1          7d19h
cattle-system   kube-api-auth-zr5xz                       1/1     Running                 2          7d22h
ingress-nginx   default-http-backend-55c845698b-rzbpv     0/1     Pending                 0          7d21h
ingress-nginx   default-http-backend-6b9ff64bb8-5cpdn     0/1     Pending                 0          101m
kube-system     canal-n6b84                               0/2     Init:CrashLoopBackOff   8          18m
kube-system     canal-rs6h8                               0/2     Init:CrashLoopBackOff   8          18m
kube-system     canal-sxddd                               0/2     Init:CrashLoopBackOff   8          18m
kube-system     canal-vbv5w                               0/2     Init:CrashLoopBackOff   8          18m
kube-system     canal-wzgpt                               0/2     Init:CrashLoopBackOff   8          18m
kube-system     coredns-84f569cb6d-cgrbt                  0/1     Pending                 0          7d21h
kube-system     coredns-autoscaler-579dd56944-7nrj5       0/1     Pending                 0          7d21h
kube-system     coredns-autoscaler-f78bc4f7d-9r68p        0/1     Pending                 0          7d21h
kube-system     metrics-server-676c489dc7-hz4wj           0/1     Pending                 0          7d21h
kube-system     metrics-server-c9cfdd487-w8dvg            0/1     Pending                 0          7d21h
kube-system     rke-coredns-addon-deploy-job-l2b2c        0/1     Completed               0          8d
kube-system     rke-ingress-controller-deploy-job-7xx5h   0/1     Completed               0          8d
kube-system     rke-metrics-addon-deploy-job-pbsjz        0/1     Completed               0          8d
kube-system     rke-network-plugin-deploy-job-gwssp       0/1     Completed               0          7d22h
kubectl -n kube-system logs canal-n6b84 --all-containers
ls: cannot access '/calico-secrets': No such file or directory
Wrote Calico CNI binaries to /host/opt/cni/bin
CNI plugin version: v3.13.0
/host/secondary-bin-dir is non-writeable, skipping
Using CNI config template from CNI_NETWORK_CONFIG environment variable.
CNI config: {
  "name": "k8s-pod-network",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "calico",
      "log_level": "WARNING",
      "datastore_type": "kubernetes",
      "nodename": "us2-k8smgmt02",
      "mtu": 1450,
      "ipam": {
          "type": "host-local",
          "subnet": "usePodCidr"
      },
      "policy": {
          "type": "k8s"
      },
      "kubernetes": {
          "kubeconfig": "/etc/kubernetes/ssl/kubecfg-kube-node.yaml"
      }
    },
    {
      "type": "portmap",
      "snat": true,
      "capabilities": {"portMappings": true}
    },
    {
      "type": "bandwidth",
      "capabilities": {"bandwidth": true}
    }
  ]
}
Created CNI config 10-canal.conflist
Done configuring CNI.  Sleep=false
failed to try resolving symlinks in path "/var/log/pods/kube-system_canal-n6b84_7c0ceb58-ca3d-41e8-b111-a46729c19837/flexvol-driver/8.log": lstat /var/log/pods/kube-system_canal-n6b84_7c0ceb58-ca3d-41e8-b111-a46729c19837/flexvol-driver/8.log: no such file or directoryError from server (BadRequest): container "calico-node" in pod "canal-n6b84" is waiting to start: PodInitializing
kubectl describe pod -n kube-system canal-n6b84
Events:
  Type     Reason     Age                   From                    Message
  ----     ------     ----                  ----                    -------
  Normal   Scheduled  24m                   default-scheduler       Successfully assigned kube-system/canal-n6b84 to us2-k8smgmt02
  Normal   Pulled     24m                   kubelet, us2-k8smgmt02  Container image "rancher/calico-cni:v3.13.0" already present on machine
  Normal   Created    24m                   kubelet, us2-k8smgmt02  Created container install-cni
  Normal   Started    24m                   kubelet, us2-k8smgmt02  Started container install-cni
  Normal   Pulled     23m (x5 over 24m)     kubelet, us2-k8smgmt02  Container image "rancher/calico-pod2daemon-flexvol:v3.13.0" already present on machine
  Normal   Created    23m (x5 over 24m)     kubelet, us2-k8smgmt02  Created container flexvol-driver
  Warning  Failed     23m (x5 over 24m)     kubelet, us2-k8smgmt02  Error: failed to start container "flexvol-driver": Error response from daemon: error while creating mount source path '/usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds': mkdir /usr/libexec/kubernetes: read-only file system
  Warning  BackOff    4m27s (x91 over 24m)  kubelet, us2-k8smgmt02  Back-off restarting failed container

Any idea where to start ?

UPDATE:
Upgraded the path as suggested here:

Upgraded also Rancher to v2.3.6

This did fix canal, but nothing else worked, as coredns and other pods could not the scheduled. When running kubectl get nodes the worker nodes did not have any label anymore, so I ran:

kubectl label node us2-k8swkr02  node-role.kubernetes.io/worker=worker
kubectl label node us2-k8swkr01  node-role.kubernetes.io/worker=worker

suddenly the pods started to schedule.

Now I’m facing 3 other issues: