New to using Rancher. Deployed the Docker image and when I went to the UI, I see a local named cluster already created.
Ignoring that, I created a new cluster using option : Use existing nodes and create a cluster using RKE and then added my Ubuntu VM with all 3 roles (etcd, CP, worker).
Post that, I have the cluster in Active state shown, but unlike the local cluster, the Explore option is disabled for my created clusters. Thus, there is not much I can do, such as looking at Nodes, Pods, or deploying anything via UI.
This is the view: (refer cluster by the name: k81-lemieux)
There is no active cluster agent connection, you can check the logs of the pod on the node(s) to see why it cannot connect and if that is fixed, it should let you explore the cluster.
I’m having the same issue. This causes the explorer to not be available for the cluster on Rancher server.
Rancher version: Rancher version 2.6.0 rke1.
Kubernetes version: kubernetes version v1.20.11-rancher1-2
Downstream, Custom via docker on new Ubuntu 20.04 QEMU KVM vm nodes.
All 3 of my masters and all 3 of my workers can resolve the domain just fine. CoreDNS pods just show some io timeout errors:
linux/amd64, go1.15.3, 054c9ae
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
I0924 16:47:56.859902 1 trace.go:205] Trace[1427131847]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (24-Sep-2021 16:47:26.859) (total time: 30000ms):
Trace[1427131847]: [30.000324176s] [30.000324176s] END
I0924 16:47:56.859923 1 trace.go:205] Trace[911902081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (24-Sep-2021 16:47:26.859) (total time: 30000ms):
Trace[911902081]: [30.000158123s] [30.000158123s] END
E0924 16:47:56.859925 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.43.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.43.0.1:443: i/o timeout
E0924 16:47:56.859933 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.43.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.43.0.1:443: i/o timeout
I0924 16:47:56.859937 1 trace.go:205] Trace[939984059]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156 (24-Sep-2021 16:47:26.859) (total time: 30000ms):
Trace[939984059]: [30.000156648s] [30.000156648s] END
E0924 16:47:56.859941 1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.2/tools/cache/reflector.go:156: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get "https://10.43.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0": dial tcp 10.43.0.1:443: i/o timeout
Does Coredns forward it’s dns requests to the DNS servers that the host has configured? It would appear so judging by the logs output, but it’s not behaving that way.
I don’t have any wildcards involved in my DNS setup yet, however I am doing split horizon DNS, where rancher.mydomain.com resolves to the public or private facing IP depending on what side of the network you’re on. It resolve correctly when tested with nslookup and curl from the host:
I’m new to Rancher, so please excuse if I’m asking obvious questions, I have tried searching both these forums and Google.
What’s the name of the image the cluster agent is spawned from? Same question for the DNS service, the only name that leads me to think of any of these is rancher-agent:v2.6.0 a container that exits gracefully (exit code 0) and the log doesn’t indicate has any problems, it is the one used for originally registering my Kubernetes nodes to Rancher, but was it supposed to keep running?
Edit:
I think I found it, the container was spawned from the same image, but called k8s_cluster-register_cattle-cluster-agent-something and did, indeed show signs of not being able to resolve the name of the Rancher server, so I suspect I can use the same workaround as @TheRealAlexV, while I try to figure out what’s the root cause of this failure.
same issue here. After replacing CP nodes of existing cluster it became unavailable in cluster explorer.
cattle-system/cattle-cluster-agent cannot resolve the name of the rancher host anymore.
Same problem for me. Just took down my working 2.5 cluster and created a new cluster with 2.6.2. The system says all the nodes are active, but I am unable to use the explore function.
Is this a bug in Rancher 2.6 or is everyone suddenly doing something wrong. Is anyone from Rancher investigating?
Same issue, after upgrade to the new helm chart version this happened. We deleted and re-imported the clusters, it worked for a few days and then it happened again. We don’t see any errors anywhere. Is anyone investigating this?
INFO: Using resolv.conf: nameserver 10.43.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local redacted.net options ndots:5
INFO: https://rancher.redacted.com/ping is accessible
INFO: rancher.redacted.com resolves to 10.50.94.128
INFO: Value from https://rancher.redacted.com/v3/settings/cacerts is an x509 certificate
time="2021-11-10T15:56:25Z" level=info msg="Listening on /tmp/log.sock"
time="2021-11-10T15:56:25Z" level=info msg="Rancher agent version v2.6.2 is starting"
time="2021-11-10T15:56:25Z" level=info msg="Connecting to wss://rancher.redacted.com/v3/connect/register with token starting with q76dtp85nzjpc6grqvhj5kfmp2k"
time="2021-11-10T15:56:25Z" level=info msg="Connecting to proxy" url="wss://rancher.redacted.com/v3/connect/register"
These are the logs from the cattle-cluster-agent pod in the cluster and yet, I can’t explore it in the UI.
I can see the machines that are part of the cluster but the explore button appears greyed out. This started happening only after the recent upgrade to 2.6
To dive deeper into our infrastructure and the issue at hand: We are importing RKE clusters into rancher, not provisioning them with rancher. As soon as I delete the cluster and re-register it the cluster is fine and I can explore it, it is only after some time (no pattern here) that I can no longer explore the cluster. The cattle pods never fail or hint to any errors
I did notice that deleting the cluster in rancher does not correctly schedule the rancher resources for deletion. I have no idea what is going on but I am intrigued that not many people seem to have this issue
I’ve imported the clusters again and they are working for the moment
The issue is that after some time they just stop working. They don’t appear unhealthy on anything the explore button just goes gray.
EDIT:
Like this actually:
Same issue here, since I upgraded Rancher to v2.6. It was working fine on Rancher v2.5.
I don’t provision clusters on Rancher, I only import them and I noticed when Rancher generates the k8s manifests to import the desired cluster and when we run this manifest against the cluster more than once it causes this issue. Despite we can continue accessing the cluster through kubectl, I cannot explore it.
This is not good when we want to put those manifests on a pipeline or over ArgoCD for example because they will run the manifests over and over since they are declarative it shouldn’t cause any issue.
Of course recreating the cluster on Rancher fixes the issue but I cannot keep recreating it all the time.
Logs coming from the cattle-agent on the imported cluster:
INFO: Environment: CATTLE_ADDRESS=10.27.155.90 CATTLE_CA_CHECKSUM= CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://172.20.34.246:80 CATTLE_CLUSTER_AGENT_PORT_443_TCP=tcp://172.20.34.246:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=172.20.34.246 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=tcp://172.20.34.246:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=172.20.34.246 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=172.20.34.246 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-667bb4f9fd-r9pdr CATTLE_SERVER=https://new-rancher.redacted CATTLE_SERVER_VERSION=v2.6.2
INFO: Using resolv.conf: nameserver 172.20.0.10 search cattle-system.svc.cluster.local svc.cluster.local cluster.local ec2.internal options ndots:5
INFO: https://new-rancher.redacted/ping is accessible
INFO: new-rancher.redacted resolves to redacted
time=“2021-12-10T22:03:46Z” level=info msg=“Listening on /tmp/log.sock”
time=“2021-12-10T22:03:46Z” level=info msg=“Rancher agent version v2.6.2 is starting”
time=“2021-12-10T22:03:46Z” level=info msg=“Connecting to wss://new-rancher.redacted/v3/connect/register with token starting with redacted”
time=“2021-12-10T22:03:46Z” level=info msg=“Connecting to proxy” url=“wss://new-rancher.redacted/v3/connect/register”
I have the same problem because fleet-agent doesn’t recognize the certificate signed by a corporate CA. When I created the cluster I started rancher-agent with SSL_CERT_DIR with the corporate CA certificate and everything works fine except fleet-agent. It seems a bug.