I had a lot of trouble upgrading Kubernetes to v1.19.6 in Rancher v2.5.5 with Canal as CNI. I tried updating three different Kubernetes versions straight to v1.19.6. The versions were v1.13.5, v1.15.5 and v1.16.15. I know this is quite the jump, but I didn’t find any major breaking changes for my setup when checking the changelog.
When upgrading from v1.13.5
kube-dns wasn’t removed and ran next to CoreDNS. This happened even though the removal job was deployed and finished successfully. I tried removing the Kube-DNS deployments manually. But CoreDNS still didn’t deploy properly.
When upgrading from v1.15.5 and v1.16.15 I ran into networking issues. At first the
nginx-ingress failed to deploy because it couldn’t bind to port 80 because
Port 80 is already in use. When I checked the port on the server, I could see that the RKE tool responsible for connecting non-controlpanel nodes
nginx-proxy was actually bound to port 80 (according to
netstat -tulpn | grep 80 and
pstree -sg PID). At first I thought this was an underlying issue on the host, but after some investigation it seemed to be an issue with the CNI.
The CoreDNS deployment did return an error message when it failed to run:
Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "8e361(...)" network for pod "coredns-(...)2": networkPlugin cni failed to set up pod "coredns-6(...)2_kube-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "calico-node" not found failed to clean up sandbox container "8e361(...)" network for pod "coredns-6(...)2": networkPlugin cni failed to teardown pod "coredns-6(...)2_kube-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "calico-node" not found]
After some digging I realized that there was indeed no “calico-node” ClusterRole. I did find the proper “calico-node” ClusterRole (calico/templates/calico-node-rbac.yaml) after some Googling and deployed it manually. After which the networking issues disappeared.
I did find another old comment on GitHub that the ClusterRoleBinding
canal-calico should be modified to point to
calico instead of
calico-node. I checked and I do indeed have a
calico ClusterRole which seems identical to the manually deployed ClusterRole.
What went wrong here? Is an upgrade which spans this many major versions not recommended? Is changing the ClusterRole manually a sustainable approach for the future? Am I missing something?
Inputs are much appreciated!
Edit: After digging some more, I found a comment in the config of the rke-network-plugin-deploy-job:
Rancher-specific: Change the calico-node ClusterRole name to calico for backwards compatibility. So as far as I can tell the ClusterRoles and ClusterRoleBindings are correct.
But I still have both a
ClusterRoleBinding/calico-node and a
ClusterRoleBinding/canal-calico. The former tries binding the missing
ClusterRole/calico-node for a missing
ServiceAccount/calico-node and the Group
system:nodes. The latter binds
ClusterRole/calico-node for the
When looking at the RBAC error I started wondering why it is using the User
system:node instead of its ServiceAccount
canal that is subject of the ClusterRole. After which I found another event on the
MountVolume.SetUp failed for volume canal-token-(...): failed to sync secret cache: timed out waiting for the condition. Why it failed mounting the secret I haven’t figured out yet.