RancherOS on VMWare failing to successfully provision a cluster

Since the upgrade to Rancher v2.1.1, I have been unable to successfully deploy a cluster on the first attempt.

Each time, the node VM’s that get brought up may or may not get an IP address via DHCP. A reboot of the server doesn’t help. You have power down and back up the server to get an IP.

Even if they do get an IP, Rancher doesn’t successfully deploy everything to the nodes, leaving the cluster down.

Seeing errors such as: [network] Host [10.85.175.29] is not able to connect to the following ports: [10.85.175.36:2379]. Please check network policies and firewall rules

Or when looking at the Etcd logs:
rafthttp: request cluster ID mismatch (got c8ba020dc536a627 want 6d31c0a3ea602366)

Shouldn’t have a cluster ID mismatch when these are brand new VM’s created by Rancher’s cluster creation process using Rancher OS.

Rancher Agent logs:
Failed to connect to proxy. web socket bad handshake

Error: failed to start containers: Kubelet

  • sleep 2
  • docker start Kubelet
    Error: response from daemon: {“message”:“No such container: kubelet”}

It’s all been wildly inconsistent and leaving me unable to bring up new clusters.

In the machines that don’t get DHCP IP, when looking at them, they also don’t have a physical network adapter (eth0) at all, even though vSphere shows it’s connected and active.

Interesting…all my new K8S nodes I’m brining up are registering their hostname in DNS as rancher.internal.mydomain.com, which happens to be the name of my rancher server itself.

RancherOS starts with the default hostname rancher if the user does not set the hostname parameter or does not add the DHCP parameter like this force_hostname=true. I think you can try add this params to your RancherOS node:

#cloud-config
hostname: xxx