** RKE2 version:**
rke2 version v1.24.12+rke2r1 (1cbcfe3c873df5a7555cde3211a144055312b2a5)
Node CPU architecture, operating system and version: (all 3 nodes are consistent)
Ubuntu 20.04.2 LTS
Linux Cube-1 5.4.0-81-generic #91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021 x86_ 64 x86_ 64 x86_ 64 GNU/Linux
Cluster configuration:
3 servers and 0 agents
cube-1 Ready control-plane,etcd,master 25d v1.24.12+rke2r1
cube-2 Ready control-plane,etcd,master 25d v1.24.12+rke2r1
cube-3 Ready control-plane,etcd,master 25d v1.24.12+rke2r1
(3 machines serve as a cluster. The domain name points to the nginx installed on another machine, which then upstreams to 3 machines. All are LAN connections.)
Problem description
After the successful deployment of rke2 HA on three machines, Rancher HA was installed through help, but the installation was not completely successful, and the Rancher service could not be accessed normally.
repro steps
- kubectl create namespace cattle-system
- kubectl -n cattle-system create secret tls tls-rancher-ingress
–cert=tls.crt
–key=tls.key
(The certificate was issued by GoDaddy.com, Inc, and it should be no problem)
- helm install rancher rancher-stable/rancher
–namespace cattle-system
–set hostname=dev.*****.com
–set bootstrapPassword=****
–set ingress.tls.source=secret
–set ingress.ingressClassName=nginx
- kubectl -n cattle-system rollout status deploy/rancher
kubectl -n cattle-system get deploy rancher
Both commands are normal
- Two helm-operation-***** containers have encountered exceptions (see logs below)
The access to the ranger service is not available, and there are issues with the ranger server logs for three nodes (see below).
Expected results:
The ranger UI can be accessed normally, and the ranger takes over the rke2 cluster normally
Actual results:
RancherUI cannot be accessed
There seems to be no problem with rke2 on the surface
Kubectl can be used normally
Rancher Server cannot be accessed
Supplementary explanation
Error will be reported during helm installation:
Error: INSTALLATION FAILED: failed to create resource: Internal error occurred: failed calling webhook “validate.nginx.ingress.kubernetes.io”: failed to call webhook: Post “ https://rke2-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1/ingresses?timeout=10s ”: context deadline exceeded
The ranger server pod has slow external access within its container and cannot access other ranger pods. Access to other hosts on the host is very fast, and external access is also very fast.
helm-operation-****** Log (Pod has been running, and finally Error)
Defaulted container “helm” out of: helm, proxy
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
Waiting for Kubernetes API to be available
rancher-webhook-****** LOG
time=“2023-05-17T13:37:43Z” level=info msg=“Rancher-webhook version v0.3.3 (1b9d829) is starting”
time=“2023-05-17T13:37:44Z” level=info msg=“generated self-signed CA certificate CN=dynamiclistener-ca@1684330664,O=dynamiclistener-org: notBefore=2023-05-17 13:37:44.002591079 +0000 UTC notAfter=2033-05-14 13:37:44.002591079 +0000 UTC”
time=“2023-05-17T13:37:44Z” level=info msg=“Listening on :9443”
time=“2023-05-17T13:37:44Z” level=info msg=“certificate CN=dynamic,O=dynamic signed by CN=dynamiclistener-ca@1684330664,O=dynamiclistener-org: notBefore=2023-05-17 13:37:44 +0000 UTC notAfter=2033-05-14 13:37:44 +0000 UTC”
time=“2023-05-17T13:37:44Z” level=warning msg=“dynamiclistener [::]:9443: no cached certificate available for preload - deferring certificate load until storage initialization or first client request”
time=“2023-05-17T13:37:44Z” level=info msg=“Creating new TLS secret for cattle-system/cattle-webhook-tls (count: 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=51A487101660C38F49AA8B5EDB4CA80D6CBA20FD]”
time=“2023-05-17T13:37:44Z” level=info msg=“Active TLS secret cattle-system/cattle-webhook-tls (ver=11899619) (count 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=51A487101660C38F49AA8B5EDB4CA80D6CBA20FD]”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting rbac.authorization.k8s.io/v1, Kind=Role controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting rbac.authorization.k8s.io/v1, Kind=RoleBinding controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=ClusterRoleTemplateBinding controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=Cluster controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting apiregistration.k8s.io/v1, Kind=APIService controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting /v1, Kind=Secret controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=GlobalRole controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Sleeping for 15 seconds then applying webhook config”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=ProjectRoleTemplateBinding controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=PodSecurityAdmissionConfigurationTemplate controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting provisioning.cattle.io/v1, Kind=Cluster controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Starting management.cattle.io/v3, Kind=RoleTemplate controller”
time=“2023-05-17T13:37:44Z” level=info msg=“Updating TLS secret for cattle-system/cattle-webhook-tls (count: 1): map[listener.cattle.io/cn-rancher-webhook.cattle-system.svc:rancher-webhook.cattle-system.svc listener.cattle.io/fingerprint:SHA1=51A487101660C38F49AA8B5EDB4CA80D6CBA20FD]”
rancher-****** part log
2023/05/17 13:36:15 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:36:19 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
2023/05/17 13:36:30 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:36:34 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
2023/05/17 13:36:45 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:36:49 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
2023/05/17 13:37:00 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:37:04 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
2023/05/17 13:37:15 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:37:19 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
W0517 13:37:29.577586 33 warnings.go:80] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=BundleNamespaceMapping
2023/05/17 13:37:29 [INFO] Watching metadata for gitjob.cattle.io/v1, Kind=GitJob
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ClusterRegistration
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=BundleDeployment
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ImageScan
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ClusterRegistrationToken
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=GitRepo
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=GitRepoRestriction
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=Content
2023/05/17 13:37:29 [INFO] Watching metadata for fleet.cattle.io/v1alpha1, Kind=ClusterGroup
2023/05/17 13:37:30 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:37:34 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
2023/05/17 13:37:45 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:37:49 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
2023/05/17 13:38:00 [ERROR] Failed to connect to peer wss://10.42.0.162/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.0.162:443: i/o timeout
2023/05/17 13:38:04 [ERROR] Failed to connect to peer wss://10.42.2.236/v3/connect [local ID=10.42.1.181]: dial tcp 10.42.2.236:443: i/o timeout
rke2-ingress-nginx-controller-****** log
W0517 13:19:12.260018 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:19:15.594027 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:21:57.720394 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:22:17.744991 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:24:51.212331 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:24:54.545758 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:24:57.880022 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
I0517 13:26:38.720027 7 store.go:658] “secret was deleted and it is used in ingress annotations. Parsing” secret=“cattle-system/tls-rancher-ingress”
W0517 13:26:38.720409 7 controller.go:1112] Service “cattle-system/rancher” does not have any active Endpoint.
W0517 13:26:38.720447 7 controller.go:1333] Error getting SSL certificate “cattle-system/tls-rancher-ingress”: local SSL certificate cattle-system/tls-rancher-ingress was not found. Using default certificate
W0517 13:26:42.053846 7 controller.go:1018] Error obtaining Endpoints for Service “cattle-system/rancher”: no object matching key “cattle-system/rancher” in local store
W0517 13:26:42.053896 7 controller.go:1333] Error getting SSL certificate “cattle-system/tls-rancher-ingress”: local SSL certificate cattle-system/tls-rancher-ingress was not found. Using default certificate
I0517 13:26:42.053972 7 controller.go:168] “Configuration changes detected, backend reload required”
I0517 13:26:42.115223 7 controller.go:185] “Backend successfully reloaded”
I0517 13:26:42.115463 7 event.go:285] Event(v1.ObjectReference{Kind:“Pod”, Namespace:“kube-system”, Name:“rke2-ingress-nginx-controller-k4fh5”, UID:“71b2cfab-c167-4dd4-99a0-1780f31754b4”, APIVersion:“v1”, ResourceVersion:“10834628”, FieldPath:""}): type: ‘Normal’ reason: ‘RELOAD’ NGINX reload triggered due to a change in configuration
I0517 13:26:45.387983 7 controller.go:168] “Configuration changes detected, backend reload required”
I0517 13:26:45.440137 7 controller.go:185] “Backend successfully reloaded”
I0517 13:26:45.440307 7 event.go:285] Event(v1.ObjectReference{Kind:“Pod”, Namespace:“kube-system”, Name:“rke2-ingress-nginx-controller-k4fh5”, UID:“71b2cfab-c167-4dd4-99a0-1780f31754b4”, APIVersion:“v1”, ResourceVersion:“10834628”, FieldPath:""}): type: ‘Normal’ reason: ‘RELOAD’ NGINX reload triggered due to a change in configuration
You must have open websocket protocol in lb or fw.