Is there better documentation for installing Rancher onto an RKE2 cluster?

I’m at wits end with Rancher, and am seven attempts into trying to get a functional install just for a basic proof of concept. I am attempting to set up a three node RKE2 cluster on a VMware environment using RHEL boxes, and then install Rancher. The intention is to use our own certificates from a private CA. I can get the RKE2 cluster built every time, and it results in a functional Kubernetes cluster.

Every time I install Rancher it fails, and not always the same way. Most of the time the installation “fails” with the following message:

Error: INSTALLATION FAILED: 1 error occurred:
* Internal error occurred: failed calling webhook “validate.nginx.ingress.kubernetes.io”: failed to call webhook: Post “https://rke2-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1/ingresses?timeout=10s”: context deadline exceeded

However, the install continues anyway and results in a non-functional Rancher environment. I had thought I had found the solution for the error above when I configured RHEL’s NetworkManager to ignore the cali and flannel interfaces, but on that attempt while the Rancher install “succeeded” the interface was incredibly slow and basically didn’t work. On this latest attempt, with the same steps as before, I’m back to the ingress error above with a failed installation.

I’ve spent many, many hours Googling these issues and either find unrelated issues, or people having very similar issues that get no replies. Is there better documentation somewhere, that actually results in a functional implementation of Rancher? If it is this difficult to spin up a simple demo environment I’m really struggling entertaining the thought of ever running this in production.

Please share your configuration file for master.

Can you please share the output of the following commands on one of your RKE2 control plane nodes?

kubectl get po -n kube-system
kubectl get po -n calico-system

More particularly, I am looking at the health of ingress-nginx , kube-proxy and all calico pods (calico-node and the controller.)

My apologies as I’m relatively new to k8s and Rancher; which file would this be? The yml file used during the installation of RKE2 to create/join the cluster, or something else?

Certainly.

kubectl get po -n kube-system

NAME                                                      READY   STATUS      RESTARTS        AGE
cloud-controller-manager-xvmexrnch0001.localdomain.local  1/1     Running     2 (4d22h ago)   5d
cloud-controller-manager-xvmexrnch0002.localdomain.local  1/1     Running     3 (4d22h ago)   5d
cloud-controller-manager-xvmexrnch0003.localdomain.local  1/1     Running     2 (4d22h ago)   5d
etcd-xvmexrnch0001.localdomain.local                      1/1     Running     1               5d
etcd-xvmexrnch0002.localdomain.local                      1/1     Running     2               5d
etcd-xvmexrnch0003.localdomain.local                      1/1     Running     2               5d
helm-install-rke2-canal-wff85                             0/1     Completed   0               5d
helm-install-rke2-coredns-8sqjn                           0/1     Completed   0               5d
helm-install-rke2-ingress-nginx-2dcgl                     0/1     Completed   0               5d
helm-install-rke2-metrics-server-2nv8f                    0/1     Completed   0               5d
helm-install-rke2-snapshot-controller-65vgr               0/1     Completed   1               5d
helm-install-rke2-snapshot-controller-crd-2rx27           0/1     Completed   0               5d
helm-install-rke2-snapshot-validation-webhook-qgfl8       0/1     Completed   0               5d
kube-apiserver-xvmexrnch0001.localdomain.local            1/1     Running     1               5d
kube-apiserver-xvmexrnch0002.localdomain.local            1/1     Running     1               5d
kube-apiserver-xvmexrnch0003.localdomain.local            1/1     Running     1               5d
kube-controller-manager-xvmexrnch0001.localdomain.local   1/1     Running     2 (4d22h ago)   5d
kube-controller-manager-xvmexrnch0002.localdomain.local   1/1     Running     3 (4d22h ago)   5d
kube-controller-manager-xvmexrnch0003.localdomain.local   1/1     Running     2 (4d22h ago)   5d
kube-proxy-xvmexrnch0001.localdomain.local                1/1     Running     2 (4d22h ago)   4d22h
kube-proxy-xvmexrnch0002.localdomain.local                1/1     Running     2 (4d22h ago)   4d22h
kube-proxy-xvmexrnch0003.localdomain.local                1/1     Running     2 (4d22h ago)   4d22h
kube-scheduler-xvmexrnch0001.localdomain.local            1/1     Running     1 (4d22h ago)   5d
kube-scheduler-xvmexrnch0002.localdomain.local            1/1     Running     1 (4d22h ago)   5d
kube-scheduler-xvmexrnch0003.localdomain.local            1/1     Running     1 (4d22h ago)   5d
rke2-canal-96v99                                          2/2     Running     2 (4d22h ago)   5d
rke2-canal-cgv67                                          2/2     Running     2 (4d22h ago)   5d
rke2-canal-klh9q                                          2/2     Running     2 (4d22h ago)   5d
rke2-coredns-rke2-coredns-565dfc7d75-mnq9l                1/1     Running     1 (4d22h ago)   5d
rke2-coredns-rke2-coredns-565dfc7d75-rs5tx                1/1     Running     1 (4d22h ago)   5d
rke2-coredns-rke2-coredns-autoscaler-6c48c95bf9-lxznf     1/1     Running     1 (4d22h ago)   5d
rke2-ingress-nginx-controller-24hcv                       1/1     Running     1 (4d22h ago)   5d
rke2-ingress-nginx-controller-rmsmj                       1/1     Running     1 (4d22h ago)   5d
rke2-ingress-nginx-controller-w5l8l                       1/1     Running     1 (4d22h ago)   5d
rke2-metrics-server-c9c78bd66-zf69j                       1/1     Running     2 (4d22h ago)   5d
rke2-snapshot-controller-6f7bbb497d-vdxs7                 1/1     Running     1 (4d22h ago)   5d
rke2-snapshot-validation-webhook-65b5675d5c-dkvn6         1/1     Running     2 (4d22h ago)   5d
kubectl get po -n calico-system

No resources found in calico-system namespace.
kubectl get ns

NAME              STATUS   AGE
default           Active   5d
kube-node-lease   Active   5d
kube-public       Active   5d
kube-system       Active   5d

I’ve never seen that calico-system namespace, even on the Rancher install that technically reached the “success screen” (but didn’t end up working well). If it’s related at all, I did not install cert-manager, as we are using our own certs from our on-prem CA and the documentation indicated it isn’t needed.

Yes, the one which you are using to install (yaml)

@OUberLord Thank you for the previous output.

Here are some additional questions:

  • Can you show me the helm install command you used to install Rancher?
  • Can you show the output of the command: kubectl get secret -n cattle-system 3-4 min after the helm install command ?
  • Please also the output of kubectl get ing rancher -n cattle-system -o yaml

Some observations:

  • Don’t worry about calico, because you are deploying canal instead and I do see the pods are active.
  • kube-proxy, coreDNS and Ingress Nginx seem to be healthy. So, it should work.
  • The fact that cert-manager is not installed is an important piece of information, because it means you have to do stuff in a very specific way to make Rancher work. However, your error message does not make much sense to me in that context.

The goal will be to:

  • Check that you are using the right options to install Rancher without cert-manager.
  • Verify that the Rancher ingress object is well configured
  • Check that there is a certificate secret in the right namespace.

@belgaied2 Absolutely, and thank you for taking the time to help me out.

One thing I did notice: Before when I showed you the namespace listing, I had forgotten that I had reverted the servers back and had installed RKE2 / gotten the cluster working, but had NOT yet proceeded to the Rancher install in this iteration of the environment. Thus, cattle-system was “missing” from that output.

The helm install command I’ve been using is:

helm install rancher rancher-stable/rancher --namespace cattle-system --set hostname=rancher-demo.localdomain.local --set bootstrapPassword=admin --set ingress.tls.source=secret --set privateCA=true --set tls=external

Regarding the secrets, this is where I’m wondering if I could be doing something incorrectly. During one of my earliest attempts at the install, after the helm install I was seeing errors about missing secrets (the ones that you later manually create). I began to create the namespace as well as the secrets first, and then run the helm install. The commands look like the following:

kubectl create namespace cattle-system
kubectl -n cattle-system create secret tls tls-rancher-ingress --cert=tls.crt --key=tls.key
kubectl -n cattle-system create secret generic tls-ca --from-file=cacerts.pem=cacerts.pem

For clarity’s sake, should I be creating these secrets first and then running the helm install, or is that incorrect? If I create them first then run the install, here is the output after 3-4 minutes:

kubectl get secret -n cattle-system

NAME                                    TYPE                                  DATA   AGE
bootstrap-secret                        Opaque                                1      3m59s
git-webhook-api-service-token-qvpcj     kubernetes.io/service-account-token   3      2m17s
helm-operation-7t9lp                    Opaque                                3      21s
helm-operation-dhl29                    Opaque                                3      83s
pod-impersonation-helm-op-nss67-token   kubernetes.io/service-account-token   3      86s
pod-impersonation-helm-op-zfqxn-token   kubernetes.io/service-account-token   3      23s
rancher-token-lwzm4                     kubernetes.io/service-account-token   3      2m59s
serving-cert                            kubernetes.io/tls                     2      2m8s
sh.helm.release.v1.rancher.v1           helm.sh/release.v1                    1      3m59s
tls-ca                                  Opaque                                1      4m47s
tls-rancher                             kubernetes.io/tls                     2      2m8s
tls-rancher-ingress                     kubernetes.io/tls                     2      4m52s
tls-rancher-internal                    kubernetes.io/tls                     2      2m7s
tls-rancher-internal-ca                 kubernetes.io/tls                     2      2m8s

As for the ingress output:

kubectl get ing rancher -n cattle-system -o yaml

Error from server (NotFound): ingresses.networking.k8s.io "rancher" not found

Also, at the same point of time after the install attempt, this is what the namespaces look like:

kubectl get ns

NAME                          STATUS   AGE
cattle-fleet-system           Active   3m12s
cattle-global-data            Active   3m55s
cattle-global-nt              Active   3m54s
cattle-impersonation-system   Active   3m40s
cattle-system                 Active   6m43s
default                       Active   6d6h
fleet-default                 Active   4m3s
fleet-local                   Active   4m44s
kube-node-lease               Active   6d6h
kube-public                   Active   6d6h
kube-system                   Active   6d6h
local                         Active   4m8s
p-6tkpk                       Active   3m49s
p-g2z6x                       Active   3m49s

It may be noteworthy that, approximately 15 minutes after the install attempt, the secrets list looks like this, with a lot of what seem to be duplicates being created:

kubectl get secret -n cattle-system

NAME                                    TYPE                                  DATA   AGE
bootstrap-secret                        Opaque                                1      13m
cattle-webhook-ca                       kubernetes.io/tls                     2      3m30s
cattle-webhook-tls                      kubernetes.io/tls                     2      3m30s
git-webhook-api-service-token-qvpcj     kubernetes.io/service-account-token   3      12m
helm-operation-7t9lp                    Opaque                                3      10m
helm-operation-dhl29                    Opaque                                3      11m
helm-operation-l968s                    Opaque                                3      3m56s
helm-operation-ljs8h                    Opaque                                3      7m3s
helm-operation-m4w2h                    Opaque                                3      8m5s
helm-operation-p67qj                    Opaque                                3      9m8s
helm-operation-tdqtr                    Opaque                                3      6m1s
helm-operation-wd7hz                    Opaque                                3      4m58s
pod-impersonation-helm-op-4mhj6-token   kubernetes.io/service-account-token   3      8m7s
pod-impersonation-helm-op-hz4kd-token   kubernetes.io/service-account-token   3      9m10s
pod-impersonation-helm-op-k42x8-token   kubernetes.io/service-account-token   3      7m5s
pod-impersonation-helm-op-nj8cs-token   kubernetes.io/service-account-token   3      3m58s
pod-impersonation-helm-op-nss67-token   kubernetes.io/service-account-token   3      11m
pod-impersonation-helm-op-rvtxb-token   kubernetes.io/service-account-token   3      6m3s
pod-impersonation-helm-op-w6nww-token   kubernetes.io/service-account-token   3      5m
pod-impersonation-helm-op-zfqxn-token   kubernetes.io/service-account-token   3      10m
rancher-token-lwzm4                     kubernetes.io/service-account-token   3      12m
serving-cert                            kubernetes.io/tls                     2      11m
sh.helm.release.v1.rancher-webhook.v1   helm.sh/release.v1                    1      3m35s
sh.helm.release.v1.rancher.v1           helm.sh/release.v1                    1      13m
tls-ca                                  Opaque                                1      14m
tls-rancher                             kubernetes.io/tls                     2      11m
tls-rancher-ingress                     kubernetes.io/tls                     2      14m
tls-rancher-internal                    kubernetes.io/tls                     2      11m
tls-rancher-internal-ca                 kubernetes.io/tls                     2      11m