Today, I tried creating a cluster of two nodes instead of one. Both of them are ubuntu 20.04 servers on KVM/QEMU VMs, 2 processors each and 16 and 8 GB RAM allocated to the VM that runs rancher container and to the VM that runs the rancher-agent respectively. Rancher v2.5.2 was tried. I was able to get the tail of the rancher container’s logs. Please see the link below:
pastebin link
Part of the problem is that the process does not terminate, it keeps trying and failing forever; therefore capturing the useful part of the log seems difficult for me.
Finally the cluster failed to create (although the rancher webUI still reported provisioning). The webUI also reported that etcd was not healthy and its logs should be checked on each machine. I did that and found the message below repeating several times:
…
2020-11-25 23:44:54.365304 I | embed: rejected connection from “192.168.1.137:52492” (er
ror “tls: failed to verify client’s certificate: x509: certificate signed by unknown auth
ority (possibly because of “crypto/rsa: verification error” while trying to verify cand
idate authority certificate “kube-ca”)”, ServerName “”)
2020-11-25 23:44:59.379546 I | embed: rejected connection from “192.168.1.137:52494” (er
ror “tls: failed to verify client’s certificate: x509: certificate signed by unknown auth
ority (possibly because of “crypto/rsa: verification error” while trying to verify cand
idate authority certificate “kube-ca”)”, ServerName “”)
2020-11-25 23:45:04.393720 I | embed: rejected connection from “192.168.1.137:52496” (er
ror “tls: failed to verify client’s certificate: x509: certificate signed by unknown auth
ority (possibly because of “crypto/rsa: verification error” while trying to verify cand
idate authority certificate “kube-ca”)”, ServerName “”)
2020-11-25 23:45:09.404556 I | embed: rejected connection from “192.168.1.137:52498” (er
ror “tls: failed to verify client’s certificate: x509: certificate signed by unknown auth
ority (possibly because of “crypto/rsa: verification error” while trying to verify cand
idate authority certificate “kube-ca”)”, ServerName “”)
I deleted the provisioning cluster, did some reading about this error message and executed sudo rm -rf *
in /etc/kubernetes directory. It is very strange because both VMs were brand new, with never-before used host names. If a certificate collusion is happening here, it is not happening due to previous installations or attempts, because there was none.
Attempted again and saw an error message about contol-plane not being able to be created, which quickly disappeared. The tail of rancher-agent logs now states this:
INFO: Arguments: --server https://192.168.1.136 --token REDACTED --ca-checksum aa03826fc5
5fbd9cd74489383eeb3f10d0ba53b6556a81af0c4d63c4b034c2c2 --etcd --controlplane --worker
INFO: Environment: CATTLE_ADDRESS=192.168.1.137 CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME
=clustervm CATTLE_ROLE=,etcd,worker,controlplane CATTLE_SERVER=https://192.168.1.136 CATT
LE_TOKEN=REDACTED
INFO: Using resolv.conf: nameserver 192.168.1.133 nameserver 192.168.1.134 nameserver 260
7:fdc8:c::2 nameserver 2607:fdc8:c::3
INFO: https://192.168.1.136/ping is accessible
INFO: Value from https://192.168.1.136/v3/settings/cacerts is an x509 certificate
time="2020-11-26T00:06:11Z" level=info msg="Rancher agent version v2.5.2 is starting"
time="2020-11-26T00:06:11Z" level=info msg="Option customConfig=map[address:192.168.1.137
internalAddress: label:map[] roles:[etcd worker controlplane] taints:[]]"
time="2020-11-26T00:06:11Z" level=info msg="Option etcd=true"
time="2020-11-26T00:06:11Z" level=info msg="Option controlPlane=true"
time="2020-11-26T00:06:11Z" level=info msg="Option worker=true"
time="2020-11-26T00:06:11Z" level=info msg="Option requestedHostname=clustervm"
time="2020-11-26T00:06:11Z" level=info msg="Listening on /tmp/log.sock"
time="2020-11-26T00:06:11Z" level=info msg="Connecting to wss://192.168.1.136/v3/connect/
register with token f56cp29s6x2nx65tl5x5v2486qp4z44t5mjcqr8cpdbst6qbbfvr7g"
time="2020-11-26T00:06:11Z" level=info msg="Connecting to proxy" url="wss://192.168.1.136
/v3/connect/register"
time="2020-11-26T00:06:12Z" level=info msg="Waiting for node to register. Either cluster
is not ready for registering or etcd and controlplane node have to be registered first"
time="2020-11-26T00:06:14Z" level=info msg="Starting plan monitor, checking every 15 seconds"
Not sure where to go from here…