Creating new clusters is very flaky - resolved

ryan · January 28, 2019, 6:24pm

Greetings,

I am unable to create a new cluster. Every time I try, I get different error messages.

Rancher version: 2.1.4
All three nodes are ubuntu 16.10 VMs, up to date with all latest packages.

I decided to “clean slate” everything:

‘docker stop’ everything on each node.
‘docker system prune -a’ on each node.
On rancher, I deleted each node and the cluster.

I rebuilt the cluster:

On rancher, created new cluster
“custom” node type
“Canal” network provider
“custom” cloud provider
I chose each node to serve all three roles: etcd, control plane, and worker
I cut and paste the docker run command into each VM
Rancher reports three nodes, but I am getting this error:

This cluster is currently Provisioning; areas that interact directly with it will not be available until the API is ready.

[network] Host [10.10.55.223] is not able to connect to the following ports: [10.10.55.224:2379, 10.10.55.224:2380]. Please check network policies and firewall rules

I have deleted everything and tried this a few times, all end with error. What am I doing wrong?

ryan · January 29, 2019, 6:59pm

Frustrated, I decided to completely start over with all new installs:

I built a new rancher server (2.1.15 - saw some cluster bugs were fixed)
I built 3 new Ubuntu 16.04 server VMs. Installed docker (v 17.03.2)
I built a new cluster in Rancher using “Custom” nodes. Each node serves all three roles.
I cut and paste the rancher provisioning script into each terminal and left.

6 hours later, the new cluster was in error state:
[controlPlane] Failed to bring up Control Plane: Failed to verify healthcheck: Failed to check https://localhost:6443/healthz for service [kube-apiserver] on host [10.10.55.222]: Get https://localhost:6443/healthz: read tcp [::1]:43414->[::1]:6443: read: connection reset by peer, log: I0129 15:07:03.469022 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.

I rebooted each node and waited. I saw the messages in rancher cycle through the standard cluster provisioning messages.

I came back to the exact same error.

ryan · January 30, 2019, 6:43pm

I resolved this issue.

Re-using RKE provisioned K8S nodes is not as simple as just stopping all the containers. I found this script from superseb that cleaned up the nodes and made them work again:

https://gist.githubusercontent.com/superseb/2cf186726807a012af59a027cb41270d/raw/7cfbce916809e7b2474a73a3da367b1a7f4ac9cf/cleanup.sh

This seems like something that needs to go into the docs!

DineshC001 · February 3, 2019, 6:02pm

You should follow these steps to remove rancher cleanly - https://rancher.com/docs/rancher/v2.x/en/admin-settings/removing-rancher/user-cluster-nodes/

Topic		Replies	Views
Can't setup a cluster in Rancher 2.0 Rancher	3	7959	June 26, 2018
New cluster create, stuck on [etcd] Building up etcd plane, cert issues Rancher	14	2639	October 19, 2021
Cannot create a cluster	8	5282	December 12, 2022
Not able to create custom cluster in VMs Rancher	2	855	June 14, 2018
Unable to create a cluster - etcd cluster is unhealthy Rancher	2	5418	September 13, 2020

Creating new clusters is very flaky - resolved

Related topics