Canal/Calico RBAC issues with Kubernetes Upgrade to v1.19.6 in Rancher v2.5.5

I had a lot of trouble upgrading Kubernetes to v1.19.6 in Rancher v2.5.5 with Canal as CNI. I tried updating three different Kubernetes versions straight to v1.19.6. The versions were v1.13.5, v1.15.5 and v1.16.15. I know this is quite the jump, but I didn’t find any major breaking changes for my setup when checking the changelog.

When upgrading from v1.13.5 kube-dns wasn’t removed and ran next to CoreDNS. This happened even though the removal job was deployed and finished successfully. I tried removing the Kube-DNS deployments manually. But CoreDNS still didn’t deploy properly.

When upgrading from v1.15.5 and v1.16.15 I ran into networking issues. At first the nginx-ingress failed to deploy because it couldn’t bind to port 80 because Port 80 is already in use. When I checked the port on the server, I could see that the RKE tool responsible for connecting non-controlpanel nodes nginx-proxy was actually bound to port 80 (according to netstat -tulpn | grep 80 and pstree -sg PID). At first I thought this was an underlying issue on the host, but after some investigation it seemed to be an issue with the CNI.

The CoreDNS deployment did return an error message when it failed to run:

Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "8e361(...)" network for pod "coredns-(...)2": networkPlugin cni failed to set up pod "coredns-6(...)2_kube-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "calico-node" not found failed to clean up sandbox container "8e361(...)" network for pod "coredns-6(...)2": networkPlugin cni failed to teardown pod "coredns-6(...)2_kube-system" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "calico-node" not found]

After some digging I realized that there was indeed no “calico-node” ClusterRole. I did find the proper “calico-node” ClusterRole (calico/templates/calico-node-rbac.yaml) after some Googling and deployed it manually. After which the networking issues disappeared.

I did find another old comment on GitHub that the ClusterRoleBinding canal-calico should be modified to point to calico instead of calico-node. I checked and I do indeed have a calico ClusterRole which seems identical to the manually deployed ClusterRole.

What went wrong here? Is an upgrade which spans this many major versions not recommended? Is changing the ClusterRole manually a sustainable approach for the future? Am I missing something?

Inputs are much appreciated!

Edit: After digging some more, I found a comment in the config of the rke-network-plugin-deploy-job: Rancher-specific: Change the calico-node ClusterRole name to calico for backwards compatibility. So as far as I can tell the ClusterRoles and ClusterRoleBindings are correct.

But I still have both a ClusterRoleBinding/calico-node and a ClusterRoleBinding/canal-calico. The former tries binding the missing ClusterRole/calico-node for a missing ServiceAccount/calico-node and the Group system:nodes. The latter binds ClusterRole/calico-node for the ServiceAccount/calico.

When looking at the RBAC error I started wondering why it is using the User system:node instead of its ServiceAccount canal that is subject of the ClusterRole. After which I found another event on the canal Deployment: MountVolume.SetUp failed for volume canal-token-(...): failed to sync secret cache: timed out waiting for the condition. Why it failed mounting the secret I haven’t figured out yet.

1 Like

A big THANK YOU!

I also upgraded from an old rancher 2.2 installation to 2.5.5 and upgraded to k8s v1.19.6 and expierienced RBAC errors in the calico-kube-controllers which prevented other services to start too.

Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "9432925e5ab168c63b37d973e3ebb77d4768dda0ff9019b46826828c2b7d5304" network for pod "memcached-vdqs7-0": networkPlugin cni failed to set up pod "memcached-vdqs7-0_memcached-pmlbw" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "calico-node" not found, failed to clean up sandbox container "9432925e5ab168c63b37d973e3ebb77d4768dda0ff9019b46826828c2b7d5304" network for pod "memcached-vdqs7-0": networkPlugin cni failed to teardown pod "memcached-vdqs7-0_memcached-pmlbw" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org "default" is forbidden: User "system:node" cannot get resource "clusterinformations" in API group "crd.projectcalico.org" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "calico-node" not found]

After trying the old comment from github (2) which did not help I finally found your comment.

Your last tip to apply the default from calico-node-rbac.yaml from https://docs.projectcalico.org/manifests/canal.yaml

through calling
kubectl auth reconcile -f /tmp/calico-node-rbac.yaml

did solve the problem for me.

Now I can deploy new Apps from catalog like memcached which was not possible.

1 Like

This fixed the problem as well for me, thank you !

1 Like

I’m glad my comment could help. Just for completeness’ sake: I have since updated multiple clusters using this method and haven’t run into any further issues. However I’m concerned that this might lead to problems further down the line.

The clusters we’re running are quite old (about 3 years) and this might have lead to the RBAC issue in the first place, since they have gone through multiple upgrades since their creation.

Have any of you also had issues with kube-dns not being removed?

Hi,
this solved the problem for me:
kubectl edit clusterrole --context=******** --namespace=kube-system system:node
add to the end of the file:

- apiGroups:
  - crd.projectcalico.org
  resources:
  - clusterinformations
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get

```

and save the file.
about upgrading only upgrade 1 minor version of kubernetes,
anyway this problem didnt connect to this.
1 Like