[SOLVED] failed calling webhook "webhook.cert-manager.io": Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.#.#.#:443: connect: no route to host

This was hassling me for a while. Eventually got this solved with some help from @brandond on Slack. Posting here to help others and make more searchable for later finding from Google.

TLDR version - you have to disable firewalld.

Longer explanation:

Trying to install Rancher 2.6.2 on CentOS 7 with as secure of a config as I can, so using RKE2 and have SELinux enabled and firewall on with ports from Rancher Docs: Port Requirements .

I have three Rancher nodes to install RKE2 (Kube version 1.21) named rch001-rch003 and I have a DNS alias rch101 that points to all three of them for access. Installed the rke2-selinux policies from GitHub - rancher/rke2-selinux: RKE2 selinux + RPM packaging for selinux and have “selinux: enabled” in /etc/rancher/rke2/config.yaml (found at SELinux - RKE2 - Rancher's Next Generation Kubernetes Distribution) before starting RKE2 install.

Setup with the normal HA Rancher on RKE2 install instructions using Helm to install Rancher and things go ok but cert-manager seeming flaky and sometimes ok and sometimes not but nothing obvious. Then on the Rancher install via helm it always fails with no route to host trying to get to running pod cert-manager-webhook and tracking through IPs in service & pod finding that all lines up.

Looking at cert-manager’s documentation tried their troubleshooting, which starts with installing cmctl tool and running cmctl check api and fails with the exact same error.

Messing around trying anything I eneded up disabling firewalld and walked away a while and then tried to demonstrate to a coworker the failure but cmctl was succeeding and I tried installing Rancher and that worked too.

Redid install with firewalld enabled and started trying cmctl on all nodes and noticed that 0-2 of them would succeed (usually 1) and the others would fail and it’d rotate around which host succeeded and which failed. Swore at apparent nondeterministic behavior for a while.

Next day remembered DNS alias such that each node was getting all three RKE2 nodes randomly based on DNS. Set DNS to just one of the hosts and two failed and the one that the cert-manager-webhook pod was running on was the only one that succeeded.

Found Compatibility with Kubernetes Platform Providers | cert-manager mentioning that with GKE extra rules would need to be set up or cert-manager wouldn’t be able to talk to the control plane, which looked like what was happening here.

Posted question on Slack and @brandond responded with pointer to Known Issues and Limitations - RKE2 - Rancher's Next Generation Kubernetes Distribution which I’d read right past (I was doing the step below it adding the interfaces to Network Manager) and it also reminded me that the Calico docs stated that it was required to disable firewalld (and RKE2 default CNI is Canal, which is Calico & Flannel combined). He also mentioned that Rancher can’t support running with firewalld enabled if the CNIs don’t support it, so it’s not exactly in their roadmap.

Just trying to post to help others later since it helped me now.