Creating new clusters is very flaky - resolved

Greetings,

I am unable to create a new cluster. Every time I try, I get different error messages.

Rancher version: 2.1.4
All three nodes are ubuntu 16.10 VMs, up to date with all latest packages.

I decided to “clean slate” everything:

  • ‘docker stop’ everything on each node.
  • ‘docker system prune -a’ on each node.
  • On rancher, I deleted each node and the cluster.

I rebuilt the cluster:

  • On rancher, created new cluster
  • “custom” node type
  • “Canal” network provider
  • “custom” cloud provider
  • I chose each node to serve all three roles: etcd, control plane, and worker
  • I cut and paste the docker run command into each VM
  • Rancher reports three nodes, but I am getting this error:
This cluster is currently Provisioning; areas that interact directly with it will not be available until the API is ready.

[network] Host [10.10.55.223] is not able to connect to the following ports: [10.10.55.224:2379, 10.10.55.224:2380]. Please check network policies and firewall rules

I have deleted everything and tried this a few times, all end with error. What am I doing wrong?

Frustrated, I decided to completely start over with all new installs:

  • I built a new rancher server (2.1.15 - saw some cluster bugs were fixed)
  • I built 3 new Ubuntu 16.04 server VMs. Installed docker (v 17.03.2)
  • I built a new cluster in Rancher using “Custom” nodes. Each node serves all three roles.
  • I cut and paste the rancher provisioning script into each terminal and left.

6 hours later, the new cluster was in error state:
[controlPlane] Failed to bring up Control Plane: Failed to verify healthcheck: Failed to check https://localhost:6443/healthz for service [kube-apiserver] on host [10.10.55.222]: Get https://localhost:6443/healthz: read tcp [::1]:43414->[::1]:6443: read: connection reset by peer, log: I0129 15:07:03.469022 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.

I rebooted each node and waited. I saw the messages in rancher cycle through the standard cluster provisioning messages.

I came back to the exact same error.

I resolved this issue.

Re-using RKE provisioned K8S nodes is not as simple as just stopping all the containers. I found this script from superseb that cleaned up the nodes and made them work again:

https://gist.githubusercontent.com/superseb/2cf186726807a012af59a027cb41270d/raw/7cfbce916809e7b2474a73a3da367b1a7f4ac9cf/cleanup.sh

This seems like something that needs to go into the docs!

You should follow these steps to remove rancher cleanly - https://rancher.com/docs/rancher/v2.x/en/admin-settings/removing-rancher/user-cluster-nodes/