Rancher flooding logs with errors

We are currently using Rancher version 2.6.9. While debugging some oidc errors we are experiencing trouble with our error logs: They are being flooded by error messages from which we are unable to identify the source of.

You may find the logs of the last 5 minutes in the gist.

We could identify several types of errors:

error syncing 'p-n7k7r/creator-project-owner': handler mgmt-auth-prtb-controller: clusters.management.cattle.io "c-fvg4w" not found, requeuing

error syncing 'p-pqzmm/creator-project-owner': handler auth-prov-v2-prtb: failed to update fleet-local/r-cluster-local-view-p-pqzmm-creator-project-owner-nk3rmcfzaj rbac.authorization.k8s.io/v1, Kind=RoleBinding for auth-prov-v2-prtb-rolebinding p-pqzmm/creator-project-owner: RoleBinding.rbac.authorization.k8s.io "r-cluster-local-view-p-pqzmm-creator-project-owner-nk3rmcfzaj" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: "": version must not be empty, metadata.ownerReferences.kind: Invalid value: "": kind must not be empty, metadata.ownerReferences.name: Invalid value: "": name must not be empty], requeuing

error syncing 'grb-ftw5p': handler grb-cluster-sync: Index with name by-cluster does not exist, requeuing

error syncing 'c-p6msc/p-jq749': handler system-image-upgrade-controller: upgrade cluster c-p6msc system service alerting failed: template system-library-rancher-monitoring incompatible with rancher version or cluster's [c-p6msc] kubernetes version, requeuing

From our understanding there could be some jobs running in the background referencing already deleted objects. Any suggestions how we could clean this up?

Kind regards

Moritz

2 Likes

We have the same issue. Any suggestions will be appreciated.

I have something vaguely similar spamming my logs with 2.6.11, prtb-related:

2023/03/28 18:35:21 [ERROR] error syncing ā€˜p-89qlv/prtb-z88r4ā€™: handler mgmt-auth-prtb-controller: cannot determine project and cluster from p-89qlv, requeuing
2023/03/28 18:35:21 [ERROR] error syncing ā€˜p-89qlv/prtb-l4lfpā€™: handler mgmt-auth-prtb-controller: cannot determine project and cluster from p-89qlv, requeuing
2023/03/28 18:35:21 [ERROR] error syncing ā€˜p-89qlv/prtb-2rqq7ā€™: handler mgmt-auth-prtb-controller: cannot determine project and cluster from p-89qlv, requeuing
2023/03/28 18:35:21 [ERROR] error syncing ā€˜p-89qlv/prtb-plm5cā€™: handler mgmt-auth-prtb-controller: cannot determine project and cluster from p-89qlv, requeuing

Fwiw- it looks like these prtbā€™s are from a project that was deleted, but the project namespace (in the rancher-hosting RKE cluster) is still in a ā€œTerminatingā€ state. Guess this is remnants of an old bug. I am going to try deleting the remaining resources in that namespace.

Well if anyone else encounters the specific errors I found, it turns out someone deleted some projects in the cluster and the project-namespaces (IN THE CLUSTER HOSTING RANCHER) were hanging around because the mgmt-auth-prtb-controller finalizer couldnā€™t complete because the ā€œprojectName:ā€ field in the prtb object had the project ID, but not the cluster.

Editing the YAML for the prtb and prepending c-(clusterID): to the project ID in the projectName: field cleared the backlog and now all those project-namespaces successfully terminated, and the logs quit spamming Rancherā€™s pods.

E.g.:

apiVersion: management.cattle.io/v3
kind: ProjectRoleTemplateBinding
metadata:
  annotations:
    field.cattle.io/creatorId: user-hkvlr
  creationTimestamp: "2020-01-30T15:58:43Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2020-03-16T13:51:30Z"
  finalizers:
  - controller.cattle.io/mgmt-auth-prtb-controller
  generation: 3
  labels:
    cattle.io/creator: norman
  name: prtb-z8vjh
  namespace: p-r7hr6
  resourceVersion: "50750630"
  uid: a18bd8e2-f936-468a-ab63-e4d04e757472
projectName: p-r7hr6
roleTemplateName: project-owner
userName: u-x9pbj
userPrincipalName: local://u-x9pbj

Changing that ā€œprojectNameā€ field to:

projectName: c-c4hlm:p-r7hr6

did the trick.

Quick for/awk/sed script to rip through a terminating namespaceā€™s PRTBs to fix:

K8SNS="p-r7hr6"
for prtb in `kubectl -n $K8SNS get projectroletemplatebindings.v3.management.cattle.io | grep ^prtb- | awk '{print $1}'`; do
    kubectl -n $K8SNS get projectroletemplatebindings.v3.management.cattle.io $prtb -o yaml > tmp.yml
    sed -i 's/^projectName: \(.*\)$/projectName: c-c4hlm:\1/' tmp.yml
    kubectl -n $K8SNS apply -f tmp.yml
done

Replace the ā€œc-c4hlmā€ with the correct cluster ID as needed.