Cluster agent in endless crash loop


any idea how to investigate the root cause of the following error?
It’s a user cluster registered as “Custom” RKE cluster in Rancher HA 2.3.5
Nodes are created externally, no provider like AWS etc. used.

After this message nothing else follows and the cluster agent crashes.


Rancher HA Cluster: v2.3.5
Nginx as external loadbalancer (in same network)
User cluster: rancher/rancher-agent:v2.3.5
Both Kubernetes v1.16.6-rancher1-2
Everything in same network

Any help appreciated.

Kind regards,

Installed the user cluster again, without to change the cidrs… now all pods in subnet and services in…
The error remains same with different IP:

level=fatal msg=“Get dial tcp i/o timeout”

Did the DNS checks from here:
There are no errors in coredns, upstream dns is reachable too from all nodes (in this example only one control and one worker)

cattle-node-agents work fine.
Can’t find the reason. :grimacing:

OK, replaced Weave CNI trough Canal and it works… but I would prefer Weave because of encryption.

This failes as mentioned above

      flannel_backend_type: vxlan
      plugin: weave
          password: ...

Canal works directly

      flannel_backend_type: vxlan
      plugin: canal

Its a very basic setup with one control plane and one worker to evaluate the whole automated setup.

Kind regards,