Install v2.2.2 on vsphere crashing rancher docker container

#1

We have been running rancher for about 6 months.

Recently updated to latest docker rancher image and it will no longer install on vsphere 6.5. Everything we have tried has ended up with the rancher docker image crashing.

docker logs -f show the spot. It’s the same spot every time:

2019/05/12 19:30:50 [INFO] (k8s-fm-20190512-master3) Waiting for VMware Tools to come online…
2019/05/12 19:30:57 [INFO] stdout: (k8s-fm-20190512-worker3) adding network: Private Network
2019/05/12 19:30:57 [INFO] (k8s-fm-20190512-worker3) adding network: Private Network
2019/05/12 19:30:58 [INFO] stdout: (k8s-fm-20190512-worker2) adding network: Private Network
2019/05/12 19:30:58 [INFO] (k8s-fm-20190512-worker2) adding network: Private Network
2019/05/12 19:30:59 [INFO] stdout: (k8s-fm-20190512-master1) adding network: Private Network
2019/05/12 19:30:59 [INFO] (k8s-fm-20190512-master1) adding network: Private Network
2019-05-12 19:31:00.503793 W | wal: sync duration of 2.715061271s, expected less than 1s
2019-05-12 19:31:01.795896 W | etcdserver: apply entries took too long [1.291324389s for 6 entries]
2019-05-12 19:31:01.795959 W | etcdserver: avoid queries with large range/delete range!
I0512 19:31:01.797276 6 trace.go:76] Trace[1764331517]: “GuaranteedUpdate etcd3: *unstructured.Unstructured” (started: 2019-05-12 19:30:59.8600067 +0000 UTC m=+229.067950235) (total time: 1.937155923s):
Trace[1764331517]: [1.936691888s] [1.935787s] Transaction committed
I0512 19:31:01.797279 6 trace.go:76] Trace[428383260]: “GuaranteedUpdate etcd3: *unstructured.Unstructured” (started: 2019-05-12 19:30:57.884785944 +0000 UTC m=+227.092729469) (total time: 3.912399887s):
Trace[428383260]: [3.91120186s] [3.91003126s] Transaction committed
I0512 19:31:01.797799 6 trace.go:76] Trace[1726044098]: “Update /apis/management.cattle.io/v3/namespaces/c-2m7xj/nodes/m-qmssh” (started: 2019-05-12 19:30:59.859130557 +0000 UTC m=+229.067074078) (total time: 1.938621682s):
Trace[1726044098]: [1.938206584s] [1.937449922s] Object stored in database
I0512 19:31:01.798475 6 trace.go:76] Trace[946890212]: “Update /apis/management.cattle.io/v3/namespaces/c-2m7xj/nodes/m-n9tkl” (started: 2019-05-12 19:30:57.883437464+0000 UTC m=+227.091380919) (total time: 3.91495619s):
Trace[946890212]: [3.914357412s] [3.913165632s] Object stored in database
I0512 19:31:01.799254 6 trace.go:76] Trace[504662872]: “GuaranteedUpdate etcd3: *unstructured.Unstructured” (started: 2019-05-12 19:30:58.545715951 +0000 UTC m=+227.753659460) (total time: 3.253454418s):
Trace[504662872]: [3.250158453s] [3.247995087s] Transaction committed
I0512 19:31:01.799817 6 trace.go:76] Trace[2110925815]: “Update /apis/management.cattle.io/v3/namespaces/c-2m7xj/nodes/m-s5jmr” (started: 2019-05-12 19:30:58.543620243 +0000 UTC m=+227.751563798) (total time: 3.256156852s):
Trace[2110925815]: [3.255670122s] [3.253850816s] Object stored in database
2019/05/12 19:31:01 [INFO] stdout: (k8s-fm-20190512-master1) Reconfiguring VM
2019/05/12 19:31:01 [INFO] (k8s-fm-20190512-master1) Reconfiguring VM
2019/05/12 19:31:01 [INFO] stdout: (k8s-fm-20190512-worker3) Reconfiguring VM
2019/05/12 19:31:01 [INFO] (k8s-fm-20190512-worker3) Reconfiguring VM
2019/05/12 19:31:01 [INFO] stdout: (k8s-fm-20190512-worker2) Reconfiguring VM
2019/05/12 19:31:01 [INFO] (k8s-fm-20190512-worker2) Reconfiguring VM
2019/05/12 19:31:02 [INFO] stdout: (k8s-fm-20190512-master2) adding network: Private Network
2019/05/12 19:31:02 [INFO] (k8s-fm-20190512-master2) adding network: Private Network
I0512 19:31:08.037062 6 leaderelection.go:231] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded
E0512 19:31:08.037167 6 server.go:207] lost master
lost lease

once the docker image restarts it tries a few times and then deletes the VMs in vsphere and crashes at the same spot. Restarts and does the same thing until i delete the cluster.

rancher/rancher:latest and rancher/rancher:v2.2.2-patch1-rc2 behave the same.

Any ideas? Anything i should try?

Thanks

Tim

#2

I have installed back to v2.2.0 and the :latest from yesterday. i have setup a completely new VM and setup TWICE. This occurs every time. I’ve read it has something to do with the TLS certs. We are literally stuck here as we cannot deploy ANYTHING now. Is this even being worked on? I’ve seen quite a few posts here and on github.

Is there anyone I can contact on this? We are being forced to look for alternatives now as everything is dead in the water on creating new clusters on vsphere.

#3

Im providing an update for this. We have gotten this to work and i wanted to document for others as i’ve seen several with similar issues.

  1. make CERTAIN that your hosts file is correct on your server running the rancher docker file.

127.0.0.1 localhost is a must
We also had an incorrect self ip and hostname in the file. I personally believe this was the issue the entire time.

so in your /etc/hosts file
1.1.1.1 myhostname

Make sure that your ip for the server and the correct name is there. I guess fqdn reverse dns might solve also. In our case the ip was correct but the name was not. I think this was causing the TLS issues.

I am available to discuss if needed .