Rancher 2.x HA CentOS not Accessible

I have been working through a Rancher RKE install on CentOS. I have reached the end of the helm based install.

In step 7 I see that rancher was successfully rolled out.

I have an Nginx load balancer and 3 RKE nodes.
The load balancer says it is getting connection refused from the upstream nodes on port 80.

Trying to directly hit port 80 and 443 on the RKE nodes fails.

Based on the troubleshooting page:

I see rancher is running:
kubectl -n cattle-system get pods
NAME READY STATUS RESTARTS AGE
rancher-756b996499-2vxwt 1/1 Running 1 31m
rancher-756b996499-94whm 1/1 Running 1 31m
rancher-756b996499-thk5g 1/1 Running 0 31m

One node had a readiness probe failure, but nothing else remarkable:
Events:
Type Reason Age From Message


Normal Scheduled default-scheduler Successfully assigned cattle-system/rancher-756b996499-94whm to ranchm02
Normal Pulling 33m kubelet, ranchm02 Pulling image “rancher/rancher:v2.4.4”
Normal Pulled 32m kubelet, ranchm02 Successfully pulled image “rancher/rancher:v2.4.4”
Normal Created 32m (x2 over 32m) kubelet, ranchm02 Created container rancher
Normal Started 32m (x2 over 32m) kubelet, ranchm02 Started container rancher
Normal Pulled 32m kubelet, ranchm02 Container image “rancher/rancher:v2.4.4” already present on machine
Warning Unhealthy 32m kubelet, ranchm02 Readiness probe failed: Get http://10.42.0.3:80/healthz: dial tcp 10.42.0.3:80: connect: connection refused

I see many instances of this error in the logs:
2020/06/03 05:04:20 [ERROR] ProjectController local/p-4qvls [system-image-upgrade-controller] failed with : upgrade cluster local system service alerting failed: get template cattle-global-data:system-library-rancher-monitoring failed, catalogTemplate.management.cattle.io “cattle-global-data/system-library-rancher-monitoring” not found

The RKE nodes have direct internet access.

Is the ERROR in the logs fatal?
How can I diagnose the issue further?

Thanks

Reaching port 80/443 on each of the nodes should get a response from the NGINX ingress controller. If this is failing, there is a network firewall or host firewall active that is blocking the connection. What is the response when you use curl -v http://<node_ip> from on the node itself, between nodes and from the PC you are working on? What CentOS version are you using, can you share docker info?

In all cases, curl says Connection refused.
CentOS version 7.8.2003.

Docker info from one of the RKE nodes:
[root@ranchm01 azahra]# docker info
Client:
Debug Mode: false

Server:
Containers: 21
Running: 14
Paused: 0
Stopped: 7
Images: 10
Server Version: 19.03.11
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-1127.8.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.638GiB
Name: ranchm01
ID: BODW:MTRA:3ECP:F7QT:B6QL:74QA:PTP7:42EC:HTHI:MQBN:N27P:I7LL
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

Firewall:
[root@ranchm01 azahra]# firewall-cmd --list-all
public (active)
target: default
icmp-block-inversion: no
interfaces: ens192
sources:
services: dhcpv6-client ssh
ports: 22/tcp 80/tcp 443/tcp 2376/tcp 2379/tcp 2380/tcp 6443/tcp 8472/udp 9099/tcp 10250/tcp 10254/tcp 30000-32767/tcp 30000-32767/udp
protocols:
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:

So its either firewall or NGINX ingress not running, can you share the output from the commands shown on https://rancher.com/docs/rancher/v2.x/en/troubleshooting/kubernetes-resources/#ingress-controller? For the firewall, did you reload the rules? Does it work when you turn it off momentarily to isolate the issue to being the firewall?

Looks like there is no ingress controller:
kubectl -n ingress-nginx get pods -o wide
No resources found in ingress-nginx namespace.

But from the last step of the rancher install:
kubectl -n cattle-system rollout status deploy/rancher
deployment “rancher” successfully rolled out

How do I found out where things went wrong with the Ingres?

It’s controlled by the ingress parameter in the cluster.yml as shown in https://rancher.com/docs/rke/latest/en/config-options/add-ons/ingress-controllers/

If its not disabled (none), then the job that deploys it should show what happened:

kubectl -n kube-system logs -l job-name=rke-ingress-controller-deploy-job

The logs yielded nothing, so I blew it all away and started again.
This time I generated the cluster.yml from rke config rather than using the minimal one.
I’m pleased to say it worked!

Thanks so much for your help @superseb!