Failed K8s upgrade from v1.15 to 1.16 - gone bad

Michal · April 8, 2020, 12:23pm

Hi All,

I was running Rancher 2.2.8 with K8s v1.15.3. Upgraded to Rancher 2.3.5 and upgraded the K8s clusters to version 1.16.8. This is a custom inhouse cluster running on CoreOS. The upgrade went bad, and now i have some some nodes on the new version some on the old, and nothing works correctly.

> kubectl get nodes
NAME            STATUS   ROLES               AGE     VERSION
us2-k8smgmt01   Ready    controlplane,etcd   7d23h   v1.16.8
us2-k8smgmt02   Ready    controlplane,etcd   8d      v1.16.8
us2-k8smgmt03   Ready    controlplane,etcd   8d      v1.16.8
us2-k8swkr01    Ready    <none>              8d      v1.15.3
us2-k8swkr02    Ready    <none>              8d      v1.15.3

> kubectl get pods --all-namespaces
NAMESPACE       NAME                                      READY   STATUS                  RESTARTS   AGE
cattle-system   cattle-cluster-agent-69b78db68c-zblhf     0/1     ContainerCreating       0          71m
cattle-system   cattle-cluster-agent-6d479b7989-hcn8w     0/1     ContainerCreating       0          70m
cattle-system   cattle-node-agent-99z8c                   1/1     Running                 2          7d22h
cattle-system   cattle-node-agent-kdvlg                   1/1     Running                 2          7d22h
cattle-system   cattle-node-agent-pxf9w                   1/1     Running                 0          108m
cattle-system   cattle-node-agent-sjq6n                   1/1     Running                 1          7d19h
cattle-system   cattle-node-agent-xxhv5                   1/1     Running                 0          121m
cattle-system   kube-api-auth-jmpts                       1/1     Running                 2          7d22h
cattle-system   kube-api-auth-x97qh                       1/1     Running                 1          7d19h
cattle-system   kube-api-auth-zr5xz                       1/1     Running                 2          7d22h
ingress-nginx   default-http-backend-55c845698b-rzbpv     0/1     Pending                 0          7d21h
ingress-nginx   default-http-backend-6b9ff64bb8-5cpdn     0/1     Pending                 0          101m
kube-system     canal-n6b84                               0/2     Init:CrashLoopBackOff   8          18m
kube-system     canal-rs6h8                               0/2     Init:CrashLoopBackOff   8          18m
kube-system     canal-sxddd                               0/2     Init:CrashLoopBackOff   8          18m
kube-system     canal-vbv5w                               0/2     Init:CrashLoopBackOff   8          18m
kube-system     canal-wzgpt                               0/2     Init:CrashLoopBackOff   8          18m
kube-system     coredns-84f569cb6d-cgrbt                  0/1     Pending                 0          7d21h
kube-system     coredns-autoscaler-579dd56944-7nrj5       0/1     Pending                 0          7d21h
kube-system     coredns-autoscaler-f78bc4f7d-9r68p        0/1     Pending                 0          7d21h
kube-system     metrics-server-676c489dc7-hz4wj           0/1     Pending                 0          7d21h
kube-system     metrics-server-c9cfdd487-w8dvg            0/1     Pending                 0          7d21h
kube-system     rke-coredns-addon-deploy-job-l2b2c        0/1     Completed               0          8d
kube-system     rke-ingress-controller-deploy-job-7xx5h   0/1     Completed               0          8d
kube-system     rke-metrics-addon-deploy-job-pbsjz        0/1     Completed               0          8d
kube-system     rke-network-plugin-deploy-job-gwssp       0/1     Completed               0          7d22h

kubectl -n kube-system logs canal-n6b84 --all-containers
ls: cannot access '/calico-secrets': No such file or directory
Wrote Calico CNI binaries to /host/opt/cni/bin
CNI plugin version: v3.13.0
/host/secondary-bin-dir is non-writeable, skipping
Using CNI config template from CNI_NETWORK_CONFIG environment variable.
CNI config: {
  "name": "k8s-pod-network",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "calico",
      "log_level": "WARNING",
      "datastore_type": "kubernetes",
      "nodename": "us2-k8smgmt02",
      "mtu": 1450,
      "ipam": {
          "type": "host-local",
          "subnet": "usePodCidr"
      },
      "policy": {
          "type": "k8s"
      },
      "kubernetes": {
          "kubeconfig": "/etc/kubernetes/ssl/kubecfg-kube-node.yaml"
      }
    },
    {
      "type": "portmap",
      "snat": true,
      "capabilities": {"portMappings": true}
    },
    {
      "type": "bandwidth",
      "capabilities": {"bandwidth": true}
    }
  ]
}
Created CNI config 10-canal.conflist
Done configuring CNI.  Sleep=false
failed to try resolving symlinks in path "/var/log/pods/kube-system_canal-n6b84_7c0ceb58-ca3d-41e8-b111-a46729c19837/flexvol-driver/8.log": lstat /var/log/pods/kube-system_canal-n6b84_7c0ceb58-ca3d-41e8-b111-a46729c19837/flexvol-driver/8.log: no such file or directoryError from server (BadRequest): container "calico-node" in pod "canal-n6b84" is waiting to start: PodInitializing

kubectl describe pod -n kube-system canal-n6b84
Events:
  Type     Reason     Age                   From                    Message
  ----     ------     ----                  ----                    -------
  Normal   Scheduled  24m                   default-scheduler       Successfully assigned kube-system/canal-n6b84 to us2-k8smgmt02
  Normal   Pulled     24m                   kubelet, us2-k8smgmt02  Container image "rancher/calico-cni:v3.13.0" already present on machine
  Normal   Created    24m                   kubelet, us2-k8smgmt02  Created container install-cni
  Normal   Started    24m                   kubelet, us2-k8smgmt02  Started container install-cni
  Normal   Pulled     23m (x5 over 24m)     kubelet, us2-k8smgmt02  Container image "rancher/calico-pod2daemon-flexvol:v3.13.0" already present on machine
  Normal   Created    23m (x5 over 24m)     kubelet, us2-k8smgmt02  Created container flexvol-driver
  Warning  Failed     23m (x5 over 24m)     kubelet, us2-k8smgmt02  Error: failed to start container "flexvol-driver": Error response from daemon: error while creating mount source path '/usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds': mkdir /usr/libexec/kubernetes: read-only file system
  Warning  BackOff    4m27s (x91 over 24m)  kubelet, us2-k8smgmt02  Back-off restarting failed container

Any idea where to start ?

UPDATE:
Upgraded the path as suggested here:

github.com/rancher/rke

Upgrade kubernetes cluster on CoreOS with Flexvolume

opened 09:19AM - 03 Nov 19 UTC

closed 11:06PM - 10 Mar 20 UTC

mohsenmottaghi

[zube]: Done

**RKE version:** v0.3.2 **Docker version: (`docker version`,`docker info` …preferred)** ``` ... Server Version: 18.06.3-ce Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 468a545b9edcd5932181eb9de8e72e92e616e86e runc version: a592beb5bc4c4092b1c0cac971afed27687340c5 init version: fec3683b97ad9c3ef73f284f176e12c44b448662 Security Options: seccomp Profile: default selinux Kernel Version: 4.19.66-coreos Operating System: Container Linux by CoreOS 2191.4.1 (Rhyolite) OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 16.26GiB Name: kw1.example.com ID: KTTJ:Q3RN:ZLSU:WTLK:EWZ3:TB3T:DWPK:ONVB:EDQB:Z57U:4SQA:KR1L Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): false HTTP Proxy: http://proxy.example.com HTTPS Proxy: http://proxy.example.com No Proxy: localhost, 127.0.0.0/8, repo.example.com, 172.16.0.0/16, proxy.example.com, proxy.example.com, proxy.example.com Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false ``` **Operating system and kernel: (`cat /etc/os-release`, `uname -r` preferred)** 4.19.66-coreos ``` $ uname -r 4.19.66-coreos ``` **Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)** Bar-metal **cluster.yml file:** ``` # If you intened to deploy Kubernetes in an air-gapped environment, # please consult the documentation on how to configure custom RKE images. nodes: - address: 172.16.101.190 port: "22" internal_address: "" role: - controlplane - etcd hostname_override: "master-0" user: arash docker_socket: /var/run/docker.sock ssh_key: "" ssh_key_path: ~/.ssh/id_rsa ssh_cert: "" ssh_cert_path: "" labels: {} - address: 172.16.101.191 port: "22" internal_address: "" role: - controlplane - etcd hostname_override: "master-1" user: arash docker_socket: /var/run/docker.sock ssh_key: "" ssh_key_path: ~/.ssh/id_rsa ssh_cert: "" ssh_cert_path: "" labels: {} - address: 172.16.101.192 port: "22" internal_address: "" role: - controlplane - etcd hostname_override: "master-2" user: arash docker_socket: /var/run/docker.sock ssh_key: "" ssh_key_path: ~/.ssh/id_rsa ssh_cert: "" ssh_cert_path: "" labels: {} - address: 172.16.101.193 port: "22" internal_address: "" role: - worker hostname_override: "worker-0" user: arash docker_socket: /var/run/docker.sock ssh_key: "" ssh_key_path: ~/.ssh/id_rsa ssh_cert: "" ssh_cert_path: "" labels: {} - address: 172.16.101.194 port: "22" internal_address: "" role: - worker hostname_override: "worker-1" user: arash docker_socket: /var/run/docker.sock ssh_key: "" ssh_key_path: ~/.ssh/id_rsa ssh_cert: "" ssh_cert_path: "" labels: {} - address: 172.16.101.195 port: "22" internal_address: "" role: - worker hostname_override: "worker-2" user: arash docker_socket: /var/run/docker.sock ssh_key: "" ssh_key_path: ~/.ssh/id_rsa ssh_cert: "" ssh_cert_path: "" labels: {} services: etcd: image: "" extra_args: {} extra_binds: [] extra_env: [] external_urls: [] ca_cert: "" cert: "" key: "" path: "" snapshot: null retention: "" creation: "" backup_config: null kube-api: image: "" extra_args: {} extra_binds: [] extra_env: [] service_cluster_ip_range: 10.1.0.0/16 service_node_port_range: "" pod_security_policy: false always_pull_images: false kube-controller: image: "" extra_args: {} extra_binds: [] extra_env: [] cluster_cidr: 10.0.0.0/16 service_cluster_ip_range: 10.1.0.0/16 scheduler: image: "" extra_args: {} extra_binds: [] extra_env: [] kubelet: image: "" extra_args: pods-per-core: 50 max-pods: 1000 volume-plugin-dir: /opt/kubernetes/kubelet-plugins/volume/exec extra_binds: - /opt/kubernetes/kubelet-plugins/volume/exec:/opt/kubernetes/kubelet-plugins/volume/exec extra_env: [] extra_binds: [] extra_env: [] cluster_domain: kube.example.com infra_container_image: "" cluster_dns_server: 10.1.0.10 fail_swap_on: false kubeproxy: image: "" extra_args: {} extra_binds: [] extra_env: [] network: plugin: calico options: {} authentication: strategy: x509 sans: [172.16.101.196] webhook: null addons: "" addons_include: [] ssh_key_path: ~/.ssh/id_rsa ssh_cert_path: "" ssh_agent_auth: false authorization: mode: rbac options: {} ignore_docker_version: false kubernetes_version: "v1.15.5-rancher1-2" private_registries: [] ingress: provider: "" options: {} node_selector: {} extra_args: {} cluster_name: "test" cloud_provider: name: "" prefix_path: "" addon_job_timeout: 0 bastion_host: address: "" port: "" user: "" ssh_key: "" ssh_key_path: "" ssh_cert: "" ssh_cert_path: "" monitoring: provider: "" options: {} restore: restore: false snapshot_name: "" dns: null ``` **Results:** $ kubectl describe -n kube-system pod calico-node-d24qb ``` ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned kube-system/calico-node-d24qb to master-1 Normal Pulled 33m kubelet, master-1 Container image "rancher/calico-cni:v3.8.1" already present on machine Normal Created 33m kubelet, master-1 Created container upgrade-ipam Normal Started 33m kubelet, master-1 Started container upgrade-ipam Normal Pulled 33m kubelet, master-1 Container image "rancher/calico-cni:v3.8.1" already present on machine Normal Created 33m kubelet, master-1 Created container install-cni Normal Started 33m kubelet, master-1 Started container install-cni Normal Created 32m (x4 over 33m) kubelet, master-1 Created container flexvol-driver Warning Failed 32m (x4 over 33m) kubelet, master-1 Error: failed to start container "flexvol-driver": Error response from daemon: error while creating mount source path '/usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds': mkdir /usr/libexec/kubernetes: read-only file system Normal Pulled 31m (x5 over 33m) kubelet, master-1 Container image "rancher/calico-pod2daemon-flexvol:v3.8.1" already present on machine Warning BackOff 3m2s (x140 over 32m) kubelet, master-1 Back-off restarting failed container ``` When we try to upgrade kubernetes cluster to v1.16.2-rancher1-1 the CNI pods will not use our flexvolume path and because of our Read-Only filesystem on CoreOS, the pod will not be running and Init:CrashLoopBackOff happen

Upgraded also Rancher to v2.3.6

This did fix canal, but nothing else worked, as coredns and other pods could not the scheduled. When running kubectl get nodes the worker nodes did not have any label anymore, so I ran:

kubectl label node us2-k8swkr02  node-role.kubernetes.io/worker=worker
kubectl label node us2-k8swkr01  node-role.kubernetes.io/worker=worker

suddenly the pods started to schedule.

Now I’m facing 3 other issues:

namespaces are gone
the two nodes are still on the old K8s version
the workers nodes can’t be deleted from the UI, the delete button is grayed out.
maybe: Deleted node resurrected and delete button is then disabled · Issue #25242 · rancher/rancher · GitHub

Topic		Replies	Views
Cattle-pods failing Rancher	2	1786	October 25, 2019
V2.0.8 Cluster stuck in upgrading state Rancher	0	1263	August 30, 2018
K8S Upgrade from v1.13.5 to v1.17.2 Rancher	4	570	February 19, 2020
Upgrade Rancher Server v1.0.1 to v1.6.2 Rancher 1.x	3	807	March 29, 2018
Rancher Release v2.10.1 Announcements	1	203	December 19, 2024

Failed K8s upgrade from v1.15 to 1.16 - gone bad

Related topics