Explore button for RKE Cluster created in Rancher stays Disabled

New to using Rancher. Deployed the Docker image and when I went to the UI, I see a local named cluster already created.

Ignoring that, I created a new cluster using option : Use existing nodes and create a cluster using RKE and then added my Ubuntu VM with all 3 roles (etcd, CP, worker).

Post that, I have the cluster in Active state shown, but unlike the local cluster, the Explore option is disabled for my created clusters. Thus, there is not much I can do, such as looking at Nodes, Pods, or deploying anything via UI.

This is the view: (refer cluster by the name: k81-lemieux)

Note: The admin user logged in has Admin privs and all the possible privs shown.

Need inputs on where am I going wrong and how can i control the cluster same way as the options are there for local?

Any inputs would be helpful :-). Kind of struck as of now.

I have this exact problem. Did you find a solution?

Have the exact same issue on my newly deployed Rancher 2.6.0, any pointers would be very much appreciated!

There is no active cluster agent connection, you can check the logs of the pod on the node(s) to see why it cannot connect and if that is fixed, it should let you explore the cluster.

1 Like

I’m having the same issue. This causes the explorer to not be available for the cluster on Rancher server.

Rancher version: Rancher version 2.6.0 rke1.
Kubernetes version: kubernetes version v1.20.11-rancher1-2
Downstream, Custom via docker on new Ubuntu 20.04 QEMU KVM vm nodes.

All 3 of my masters and all 3 of my workers can resolve the domain just fine. CoreDNS pods just show some io timeout errors:

linux/amd64, go1.15.3, 054c9ae
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
I0924 16:47:56.859902       1 trace.go:205] Trace[1427131847]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (24-Sep-2021 16:47:26.859) (total time: 30000ms):
Trace[1427131847]: [30.000324176s] [30.000324176s] END
I0924 16:47:56.859923       1 trace.go:205] Trace[911902081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (24-Sep-2021 16:47:26.859) (total time: 30000ms):
Trace[911902081]: [30.000158123s] [30.000158123s] END
E0924 16:47:56.859925       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.43.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.43.0.1:443: i/o timeout
E0924 16:47:56.859933       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.43.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.43.0.1:443: i/o timeout
I0924 16:47:56.859937       1 trace.go:205] Trace[939984059]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (24-Sep-2021 16:47:26.859) (total time: 30000ms):
Trace[939984059]: [30.000156648s] [30.000156648s] END
E0924 16:47:56.859941       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get "https://10.43.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.43.0.1:443: i/o timeout

Does Coredns forward it’s dns requests to the DNS servers that the host has configured? It would appear so judging by the logs output, but it’s not behaving that way.

I don’t have any wildcards involved in my DNS setup yet, however I am doing split horizon DNS, where rancher.mydomain.com resolves to the public or private facing IP depending on what side of the network you’re on. It resolve correctly when tested with nslookup and curl from the host:

ubuntu@k8s-master01:~$ curl -k https://rancher.mydomain.com/ping
pong

EDIT: I just worked around this issue by editing the cattle-cluster-agent deployment and adding a hostalias for rancher.mydomain.com to the pods spec. Adding entries to Pod /etc/hosts with HostAliases | Kubernetes

1 Like

I’m new to Rancher, so please excuse if I’m asking obvious questions, I have tried searching both these forums and Google.

What’s the name of the image the cluster agent is spawned from? Same question for the DNS service, the only name that leads me to think of any of these is rancher-agent:v2.6.0 a container that exits gracefully (exit code 0) and the log doesn’t indicate has any problems, it is the one used for originally registering my Kubernetes nodes to Rancher, but was it supposed to keep running?

Edit:
I think I found it, the container was spawned from the same image, but called k8s_cluster-register_cattle-cluster-agent-something and did, indeed show signs of not being able to resolve the name of the Rancher server, so I suspect I can use the same workaround as @TheRealAlexV, while I try to figure out what’s the root cause of this failure.

I have this exact problem. Did you find a solution?

Worked well for me !!

Adding detailed steps to fix the same and get started:

  • Make sure you have kubectl installed in your machine to access the RKE installed kubernetes.

  • Find the kube config file installed by RKE - “find / -name kube_config_cluster.yml”

  • export KUBECONFIG=$PWD/kube_config_cluster.yml

  • kubectl -n cattle-system get pods -l app=cattle-agent -o wide

  • Check the logs - kubectl logs cattle-cluster-agent- -n cattle-system

In my case the error was:

[root@kubeserver1 ~]# kubectl logs cattle-cluster-agent-6f6584c74c-9krm5 -n cattle-system

INFO: Environment: CATTLE_ADDRESS=10.42.0.12 CATTLE_CA_CHECKSUM=b07578fffc861bbbcce0d6180e72dcbbe61db9b01d98844271d45ec19676a40a CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://10.43.240.166:80 CATTLE_CLUSTER_AGENT_PORT_443_TCP=tcp://10.43.240.166:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=10.43.240.166 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=tcp://10.43.240.166:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=10.43.240.166 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=10.43.240.166 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=true CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-6f6584c74c-9krm5 CATTLE_SERVER=https://host.rancher.com CATTLE_SERVER_VERSION=v2.6.1

INFO: Using resolv.conf: nameserver 192.168.18.1 nameserver fe80::1%eth2 search rancher. com

ERROR: http s://kubeserver1.rancher.com/ping is not accessible (Could not resolve host: host.rancher. com)

Update Deployment as below:

  • kubectl edit deployment cattle-cluster-agent -n cattle-system
    (Please make sure the deployment update has proper indentation)

    spec:
    hostAliases:
    - ip: "127.0.0.1"
    ** hostnames:**
    ** - “kubeserver1”**
    ** - “kubeserver1.rancher. com”**
    ** - ip: “192.168.18.39”**
    ** hostnames:**
    ** - “kubeserver1.rancher. com”**
    ** - “kubeserver1”**
    containers:
    - env:
    - name: CATTLE_FEATURES
    value: embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false
    - name: CATTLE_IS_RKE
    value: “true”
    - name: CATTLE_SERVER
    value: https://kubeserver1.rancher.com
    - name: CATTLE_CA_CHECKSUM
    value: b07578fffc861bbbcce0d6180e72dcbbe61db9b01d98844271d45ec19676a40a
    - name: CATTLE_CLUSTER
    value: “true”
    - name: CATTLE_K8S_MANAGED
    value: “true”
    - name: CATTLE_CLUSTER_REGISTRY
    - name: CATTLE_SERVER_VERSION
    value: v2.6.1
    image: rancher/rancher-agent:v2.6.1
    imagePullPolicy: IfNotPresent
    name: cluster-register
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /cattle-credentials
    name: cattle-credentials
    readOnly: true
    dnsPolicy: ClusterFirst

  • kubectl apply -f /tmp/kubectl-edit-1556071999.yaml

  • [root@kubeserver1 ~]# kubectl get pods --all-namespaces
    NAMESPACE NAME READY STATUS RESTARTS AGE
    cattle-fleet-system fleet-agent-8c9786db5-pc7gt 1/1 Running 0 101m
    cattle-system cattle-cluster-agent-5f84d5bbfd-t6zfj 1/1 Running 0 48m
    cattle-system cattle-node-agent-nhph7 1/1 Running 0 101m
    cattle-system kube-api-auth-zwskv 1/1 Running 0 101m
    ingress-nginx ingress-nginx-admission-create-nf69q 0/1 Completed 0 101m
    ingress-nginx ingress-nginx-admission-patch-n6jst 0/1 Completed 0 101m
    ingress-nginx nginx-ingress-controller-62vdg 1/1 Running 0 101m
    kube-system calico-kube-controllers-6c977d77bc-twg5q 1/1 Running 0 101m
    kube-system canal-zmbsd 2/2 Running 0 101m
    kube-system coredns-685d6d555d-c5bgp 1/1 Running 0 101m
    kube-system coredns-autoscaler-57fd5c9bd5-th2bb 1/1 Running 0 101m
    kube-system metrics-server-7bf4b68b78-89kgl 1/1 Running 0 101m
    kube-system rke-coredns-addon-deploy-job-pgcmc 0/1 Completed 0 101m
    kube-system rke-ingress-controller-deploy-job-kj9mm 0/1 Completed 0 101m
    kube-system rke-metrics-addon-deploy-job-9lm2p 0/1 Completed 0 101m
    kube-system rke-network-plugin-deploy-job-67h2d 0/1 Completed 0 102m

** Cluster explorer would be up once agent is up !!**

same issue here. After replacing CP nodes of existing cluster it became unavailable in cluster explorer.
cattle-system/cattle-cluster-agent cannot resolve the name of the rancher host anymore.

Same problem for me. Just took down my working 2.5 cluster and created a new cluster with 2.6.2. The system says all the nodes are active, but I am unable to use the explore function.

Is this a bug in Rancher 2.6 or is everyone suddenly doing something wrong. Is anyone from Rancher investigating?

Same issue, after upgrade to the new helm chart version this happened. We deleted and re-imported the clusters, it worked for a few days and then it happened again. We don’t see any errors anywhere. Is anyone investigating this?

@josesolis2201 @Simon_Carr Is your cattle-cluster-agent running ? Please check status and pod logs.

@Arivoli_Murugan

INFO: Using resolv.conf: nameserver 10.43.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local redacted.net options ndots:5
INFO: https://rancher.redacted.com/ping is accessible
INFO: rancher.redacted.com resolves to 10.50.94.128
INFO: Value from https://rancher.redacted.com/v3/settings/cacerts is an x509 certificate
time="2021-11-10T15:56:25Z" level=info msg="Listening on /tmp/log.sock"
time="2021-11-10T15:56:25Z" level=info msg="Rancher agent version v2.6.2 is starting"
time="2021-11-10T15:56:25Z" level=info msg="Connecting to wss://rancher.redacted.com/v3/connect/register with token starting with q76dtp85nzjpc6grqvhj5kfmp2k"
time="2021-11-10T15:56:25Z" level=info msg="Connecting to proxy" url="wss://rancher.redacted.com/v3/connect/register"

These are the logs from the cattle-cluster-agent pod in the cluster and yet, I can’t explore it in the UI.
I can see the machines that are part of the cluster but the explore button appears greyed out. This started happening only after the recent upgrade to 2.6

To dive deeper into our infrastructure and the issue at hand: We are importing RKE clusters into rancher, not provisioning them with rancher. As soon as I delete the cluster and re-register it the cluster is fine and I can explore it, it is only after some time (no pattern here) that I can no longer explore the cluster. The cattle pods never fail or hint to any errors

I did notice that deleting the cluster in rancher does not correctly schedule the rancher resources for deletion. I have no idea what is going on but I am intrigued that not many people seem to have this issue

I’ve imported the clusters again and they are working for the moment
The issue is that after some time they just stop working. They don’t appear unhealthy on anything the explore button just goes gray.
EDIT:
Like this actually:

Same issue here, since I upgraded Rancher to v2.6. It was working fine on Rancher v2.5.

I don’t provision clusters on Rancher, I only import them and I noticed when Rancher generates the k8s manifests to import the desired cluster and when we run this manifest against the cluster more than once it causes this issue. Despite we can continue accessing the cluster through kubectl, I cannot explore it.

This is not good when we want to put those manifests on a pipeline or over ArgoCD for example because they will run the manifests over and over since they are declarative it shouldn’t cause any issue.

Of course recreating the cluster on Rancher fixes the issue but I cannot keep recreating it all the time.

Logs coming from the cattle-agent on the imported cluster:

INFO: Environment: CATTLE_ADDRESS=10.27.155.90 CATTLE_CA_CHECKSUM= CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://172.20.34.246:80 CATTLE_CLUSTER_AGENT_PORT_443_TCP=tcp://172.20.34.246:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=172.20.34.246 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=tcp://172.20.34.246:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=172.20.34.246 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=172.20.34.246 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-667bb4f9fd-r9pdr CATTLE_SERVER=https://new-rancher.redacted CATTLE_SERVER_VERSION=v2.6.2
INFO: Using resolv.conf: nameserver 172.20.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local ec2.internal options ndots:5
INFO: https://new-rancher.redacted/ping is accessible
INFO: new-rancher.redacted resolves to redacted
time=“2021-12-10T22:03:46Z” level=info msg=“Listening on /tmp/log.sock”
time=“2021-12-10T22:03:46Z” level=info msg=“Rancher agent version v2.6.2 is starting”
time=“2021-12-10T22:03:46Z” level=info msg=“Connecting to wss://new-rancher.redacted/v3/connect/register with token starting with redacted”
time=“2021-12-10T22:03:46Z” level=info msg=“Connecting to proxy” url=“wss://new-rancher.redacted/v3/connect/register”

I have the same problem because fleet-agent doesn’t recognize the certificate signed by a corporate CA. When I created the cluster I started rancher-agent with SSL_CERT_DIR with the corporate CA certificate and everything works fine except fleet-agent. It seems a bug.