New cluster create, stuck on [etcd] Building up etcd plane, cert issues

Hi folks, I’m brand new to rancher and trying it in my homelab. Set up as follows
4x Vms running alpine linux hostnames rancher1–rancher4 (virt host is proxmox)
installed docker, and ran the following to create the mgmgt/cluster:
docker run -d --restart=unless-stopped -p 80:80 -p 443:443 --privileged rancher/rancher

That worked, i went into the gui and all looked fine.

I then used gui to create new custom cluster, selected all roles(etcd,controlplane,worker) and got the nice long docker command created for me in the GUI

docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/k
ubernetes -v /var/run:/var/run  rancher/rancher-agent:v2.6.1 --server https://192.168.2.116 --token
x4sfmrsflnkh4lsdrk4pb7s9mk9zh6sq44k8g9sjs5fhbmlrnhc292 --ca-checksum eb6ee92f0ae26c031544403aa91beac
cbe4440bca72ab157e85f50a3ea9ff49c --etcd --controlplane --worker

I ran this command on all 4 rancher hosts, and it got stuck on [etcd] Building up etcd plane status, and the etcd log message is:
2021-10-13 18:56:56.937082 I | embed: rejected connection from "192.168.2.118:56496" (error "remote error: tls: bad certificate", ServerName "")

At this point, i am stuck, should i wait for failure, should i wipe all 4 vms and reinstall everything from scratch, or are there fix-it steps i can do, or clean-up steps ?

All feedback appreciated

k

Still not able to get a functioning cluster created, it just sits at

2021-10-14 16:24:59.062878 W | rafthttp: health check for peer afdb501080a14e32 could not connect: x509: certificate signed by unknown authority (possibly becausrity certificate "kube-ca")
2021-10-14 16:24:59.090358 E | etcdserver: publish error: etcdserver: request timed out
2021-10-14 16:24:59.127466 I | embed: rejected connection from "192.168.2.118:35420" (error "remote error: tls: bad certificate", ServerName "")
2021-10-14 16:24:59.127927 I | embed: rejected connection from "192.168.2.118:35422" (error "remote error: tls: bad certificate", ServerName "")

Cleaning nodes properly is described on Rancher Docs: Removing Kubernetes Components from Nodes, please supply full debug log of rke up (rke --debug up) after cleaning existing nodes or using new nodes.

I have no rke executable…
/var/lib/rancher/rke is a directory with logs

Right, you are using Rancher. OK, the provisioning log is also printed in Rancher container so that will help. The cleaning nodes still applies to the nodes you are trying to add to the cluster using the docker run command.

You can change log levels as described on Rancher Docs: Logging

I can re-provision quicker than the cleaning… is that more/less helpful?

It is usually left over certificates on the nodes or date/time mismatch, but we need full logs from Rancher to diagnose futher.

Yes creating a new cluster and adding newly created nodes is the best way to rule that out (except date/time obviously)

These are Alpine VMs on a proxmox cluster. Proxmox nodes have ntp working.
I added chrony ntp to the alpine vms also. Hosts and guest are all set to CST6CDT timezone.
VMs have static IPs (192.168.2.116-119). rancher1.homelab. DNS entry exists

I am very confused why the “all-in-one” docker command would get tripped up in a clean set of VMs. The command consistently fails to automagically handle the cert stuff.
I can try other distros if alpine is troublesome on some level.

Rebuilt all VMs with Ubuntu, and the cluster creation still doesn’t work.

I. Simply. Cannot. Make. This. Work.

Provide the info to reproduce (exact OS, docker info, other settings) and logging (debug logging that is shown when the cluster is being provisioned in Rancher container) that appears so people can look into a possible root cause of the issue

docker@rancher1:~$ docker info
Client:
Context: default
Debug Mode: false

Server:
Containers: 4
Running: 2
Paused: 0
Stopped: 2
Images: 5
Server Version: 20.10.8
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
Default Runtime: runc
Init Binary: docker-init
containerd version: e25210fe30a0a703442421b0f60afac609f950a3
runc version:
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.4.0-88-generic
Operating System: Ubuntu Core 18
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 15.64GiB
Name: rancher1
ID: V5AS:HUHS:FRBX:YOST:4L4E:XK3D:GBCG:JF37:OBQA:ZQBA:657W:4ZLK
Docker Root Dir: /var/snap/docker/common/var-lib-docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support
docker@rancher1:~$

Ubuntu 20.04.3 (On Proxmox 7.0.13)

On the web gui, it’s popping up “[etcd] Failed to bring up Etcd Plane: Failed to start [etcd-fix-perm] container on host [192.168.2.116]: Error response from daemon: error while creating mount source path ‘/var/lib/etcd’: mkdir /var/lib/etcd: read-only file system”

docker@rancher1:~$ ls -ld /var/lib/etcd/
drwxr-xr-x 2 root root 4096 Oct 18 14:26 /var/lib/etcd/
docker@rancher1:~$

Could this be apparmor related?

I think this means you installed Docker using snap which has been badly broken before (or was never fixed), please install using upstream sources (Install Docker Engine on Ubuntu | Docker Documentation)

1 Like

Leave it to Ubuntu, eh…
I selected docker during the install of ubuntu so yeah, that could be it. Goodbye Ubuntu, hello regular Debian

Gave up. Will try again after new rancher/server/something version
An infrastructure guy like myself has no chance when things aren’t working