I had a lot of trouble upgrading Kubernetes to v1.19.6 in Rancher v2.5.5 with Canal as CNI. I tried updating three different Kubernetes versions straight to v1.19.6. The versions were v1.13.5, v1.15.5 and v1.16.15. I know this is quite the jump, but I didn’t find any major breaking changes for my setup when checking the changelog.
When upgrading from v1.13.5 kube-dns
wasn’t removed and ran next to CoreDNS. This happened even though the removal job was deployed and finished successfully. I tried removing the Kube-DNS deployments manually. But CoreDNS still didn’t deploy properly.
When upgrading from v1.15.5 and v1.16.15 I ran into networking issues. At first the nginx-ingress
failed to deploy because it couldn’t bind to port 80 because Port 80 is already in use
. When I checked the port on the server, I could see that the RKE tool responsible for connecting non-controlpanel nodes nginx-proxy
was actually bound to port 80 (according to netstat -tulpn | grep 80
and pstree -sg PID
). At first I thought this was an underlying issue on the host, but after some investigation it seemed to be an issue with the CNI.
The CoreDNS deployment did return an error message when it failed to run:
Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "8e361(...)" network for pod "coredns-(...)2": networkPlugin cni failed to set up pod "coredns-6(...)2_kube-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "calico-node" not found failed to clean up sandbox container "8e361(...)" network for pod "coredns-6(...)2": networkPlugin cni failed to teardown pod "coredns-6(...)2_kube-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "calico-node" not found]
After some digging I realized that there was indeed no “calico-node” ClusterRole. I did find the proper “calico-node” ClusterRole (calico/templates/calico-node-rbac.yaml) after some Googling and deployed it manually. After which the networking issues disappeared.
I did find another old comment on GitHub that the ClusterRoleBinding canal-calico
should be modified to point to calico
instead of calico-node
. I checked and I do indeed have a calico
ClusterRole which seems identical to the manually deployed ClusterRole.
What went wrong here? Is an upgrade which spans this many major versions not recommended? Is changing the ClusterRole manually a sustainable approach for the future? Am I missing something?
Inputs are much appreciated!
Edit: After digging some more, I found a comment in the config of the rke-network-plugin-deploy-job: Rancher-specific: Change the calico-node ClusterRole name to calico for backwards compatibility
. So as far as I can tell the ClusterRoles and ClusterRoleBindings are correct.
But I still have both a ClusterRoleBinding/calico-node
and a ClusterRoleBinding/canal-calico
. The former tries binding the missing ClusterRole/calico-node
for a missing ServiceAccount/calico-node
and the Group system:nodes
. The latter binds ClusterRole/calico-node
for the ServiceAccount/calico
.
When looking at the RBAC error I started wondering why it is using the User system:node
instead of its ServiceAccount canal
that is subject of the ClusterRole. After which I found another event on the canal
Deployment: MountVolume.SetUp failed for volume canal-token-(...): failed to sync secret cache: timed out waiting for the condition
. Why it failed mounting the secret I haven’t figured out yet.