Cannot create a cluster

Hi,

I successfully installed Rancher (2.5.2) on an Ubuntu 20.04 VM (4 cores, 16 GB RAM) but failed to create a cluster after several attempts. I am not sure how to troubleshoot. This is what I am doing:

  • Click the button to Add cluster
  • Define a name for the cluster, then click Next without changing any default settings
  • Select all options: etcd, control and worker. Copy the command generated and execute it on the same VM.
  • Wait for several hours to watch never-ending provisioning. Based on what is displayed on Web IU, the process looks like entering a infinite loop of creating and removing containers. This is also supported by serial docker ps -a command outputs on the console/ssh.
  • It may be worth to mention that I am behind a router/NAT if that makes any difference. I forwarded ports 80 and 443 to the VM at some point, which seemed to help the provisioning progress slightly further, but things seemed to enter the infinite loop after about 30-45 minutes again.

All of these steps were reproduced with Rancher 2.4.9. I also tried them with CentOS 7, 8, and Ubuntu 18.4 without success and same results. I must be clearly doing something wrong but not sure where to start troubleshooting other than what I already did. I would appreciate any pointers.

If you are running on the same machine, did you follow https://rancher.com/docs/rancher/v2.x/en/installation/other-installation-methods/single-node-docker/advanced/#running-rancher-rancher-and-rancher-rancher-agent-on-the-same-node? I also don’t see the router/NAT situation if you run it on the same machine as that shouldn’t matter (unless you are using addresses/DNS that cross local and use router/NAT).

The container running the rancher/rancher:v2.5.2 image will log the provisioning process, can you share that logging which will indicate what the issue is. The output from docker ps -a from the node also helps to see what is being created and what is not created.

Thank you so much for your reply.

Yes, I did. I tried several combinations such as -p 8080:80 -p 8443:443 as described in that link, but also -p 9090:80 -p 9091:443 options with no change in behavior.

Would you be so kind to point me how to get to those logs? Please forgive my ignorance on this.

Yeah, I was doing that, which shows port-listener containers spawning and removing one after another at about 30 minute mark and thereafter. etcd container is usually created without issues, and stays alive.

Today, I tried creating a cluster of two nodes instead of one. Both of them are ubuntu 20.04 servers on KVM/QEMU VMs, 2 processors each and 16 and 8 GB RAM allocated to the VM that runs rancher container and to the VM that runs the rancher-agent respectively. Rancher v2.5.2 was tried. I was able to get the tail of the rancher container’s logs. Please see the link below:

pastebin link

Part of the problem is that the process does not terminate, it keeps trying and failing forever; therefore capturing the useful part of the log seems difficult for me.

Finally the cluster failed to create (although the rancher webUI still reported provisioning). The webUI also reported that etcd was not healthy and its logs should be checked on each machine. I did that and found the message below repeating several times:


2020-11-25 23:44:54.365304 I | embed: rejected connection from “192.168.1.137:52492” (er
ror “tls: failed to verify client’s certificate: x509: certificate signed by unknown auth
ority (possibly because of “crypto/rsa: verification error” while trying to verify cand
idate authority certificate “kube-ca”)”, ServerName “”)
2020-11-25 23:44:59.379546 I | embed: rejected connection from “192.168.1.137:52494” (er
ror “tls: failed to verify client’s certificate: x509: certificate signed by unknown auth
ority (possibly because of “crypto/rsa: verification error” while trying to verify cand
idate authority certificate “kube-ca”)”, ServerName “”)
2020-11-25 23:45:04.393720 I | embed: rejected connection from “192.168.1.137:52496” (er
ror “tls: failed to verify client’s certificate: x509: certificate signed by unknown auth
ority (possibly because of “crypto/rsa: verification error” while trying to verify cand
idate authority certificate “kube-ca”)”, ServerName “”)
2020-11-25 23:45:09.404556 I | embed: rejected connection from “192.168.1.137:52498” (er
ror “tls: failed to verify client’s certificate: x509: certificate signed by unknown auth
ority (possibly because of “crypto/rsa: verification error” while trying to verify cand
idate authority certificate “kube-ca”)”, ServerName “”)

I deleted the provisioning cluster, did some reading about this error message and executed sudo rm -rf * in /etc/kubernetes directory. It is very strange because both VMs were brand new, with never-before used host names. If a certificate collusion is happening here, it is not happening due to previous installations or attempts, because there was none.

Attempted again and saw an error message about contol-plane not being able to be created, which quickly disappeared. The tail of rancher-agent logs now states this:

INFO: Arguments: --server https://192.168.1.136 --token REDACTED --ca-checksum aa03826fc5
5fbd9cd74489383eeb3f10d0ba53b6556a81af0c4d63c4b034c2c2 --etcd --controlplane --worker
INFO: Environment: CATTLE_ADDRESS=192.168.1.137 CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME
=clustervm CATTLE_ROLE=,etcd,worker,controlplane CATTLE_SERVER=https://192.168.1.136 CATT
LE_TOKEN=REDACTED
INFO: Using resolv.conf: nameserver 192.168.1.133 nameserver 192.168.1.134 nameserver 260
7:fdc8:c::2 nameserver 2607:fdc8:c::3
INFO: https://192.168.1.136/ping is accessible
INFO: Value from https://192.168.1.136/v3/settings/cacerts is an x509 certificate
time="2020-11-26T00:06:11Z" level=info msg="Rancher agent version v2.5.2 is starting"
time="2020-11-26T00:06:11Z" level=info msg="Option customConfig=map[address:192.168.1.137
internalAddress: label:map[] roles:[etcd worker controlplane] taints:[]]"
time="2020-11-26T00:06:11Z" level=info msg="Option etcd=true"
time="2020-11-26T00:06:11Z" level=info msg="Option controlPlane=true"
time="2020-11-26T00:06:11Z" level=info msg="Option worker=true"
time="2020-11-26T00:06:11Z" level=info msg="Option requestedHostname=clustervm"
time="2020-11-26T00:06:11Z" level=info msg="Listening on /tmp/log.sock"
time="2020-11-26T00:06:11Z" level=info msg="Connecting to wss://192.168.1.136/v3/connect/
register with token f56cp29s6x2nx65tl5x5v2486qp4z44t5mjcqr8cpdbst6qbbfvr7g"
time="2020-11-26T00:06:11Z" level=info msg="Connecting to proxy" url="wss://192.168.1.136
/v3/connect/register"
time="2020-11-26T00:06:12Z" level=info msg="Waiting for node to register. Either cluster
is not ready for registering or etcd and controlplane node have to be registered first"
time="2020-11-26T00:06:14Z" level=info msg="Starting plan monitor, checking every 15 seconds"

Not sure where to go from here…

The tail won’t show the provisioning process which is failing, we’ll need the complete logs which will include that to see what it is reporting. Based on this logging it seems that your disk IO is very low and etcd (both embedded in Rancher and the one for the cluster) needs decent disk IO (SSD) to perform. How to check/investigate this is listed in https://github.com/rancher/rke/issues/2295

1 Like

Thank you @superseb for figuring out the problem. I truly appreciate it. Both VMs are physically on the same HDD (not SSD), and I don’t need to check benchmarks to confirm that it doesn’t have top notch IO. Do the worker nodes have similar IO requirements, or would a similar HDD work for them?

No there won’t be anything on there that requires high IOPS, obviously it will help but its mainly a kubelet that executes whatever it receives. So the IOPS will only affect the pulling/starting of images/containers.

1 Like