Rancher 2.3.5 no longer able to provision nodes using vmware on prem plugin

areed · January 31, 2020, 3:51pm

I was having the same issue with 2.3.4. It happens provisioning cont, etcd, and worker nodes. Below is the error.
[workerPlane] Failed to bring up Worker Plane: [Failed to start [kubelet] container on host [x.x.x.x]: Error response from daemon: path /var/lib/rancher is mounted on / but it is not a shared mount]

areed · January 31, 2020, 5:05pm

My node template is set to Install from boot2docker ISO.

areed · February 1, 2020, 8:05pm

I’m also on the latest version of vcenter 6.7 and both ESXI and vcenter are all up to date.

andy · February 2, 2020, 3:26am

Same issue here, I posted a few days ago here: On-Premise VSphere provisioning with Rancher 2.3.3 and 2.3.4. My workaround was to downgrade to 2.3.2, which still talks to my vcenter environment reliably.

Bryan_Dobson · February 3, 2020, 8:07pm

I am currently stuck on 2.3.2 because this is STILL not fixed. How is this now a much bigger, more urgent issue and not something being left behind?

andy · February 4, 2020, 12:32am

THAT is a question I can speculate on.

Kubernetes in particular and server software in general has been moving in the direction of “Cloud First” for a while. People running on-premise infrastructure are definitely in the position of second-class operators today when it comes to stuff working out-of-the-box. We wouldn’t be having this conversation if we were all living our best lives on GKE or AWS where provisioning, load balancers, and network infrastructure were too pedestrian for us to worry our pretty little heads about.

In the land of servers you can touch, that your boss buys for you and live in a room in your office that you can lock and unplug and generally control access to, you had better be prepared to research and adapt. An excellent example is MetalLB, which as VMWare-provider-compadres we both are probably running. The official K8s and Rancher documentation, when I had to figure this out, basically said if you’re running on-prem you’re screwed and good luck trying to run something like Jupyter. It wasn’t a huge imposition to hunt down and research MetalLB, but having to deal with it certainly sent a message that our needs would be served second.

I hasten to add that I’m tremendously grateful for Rancher. It’s an excellent piece of software that basically eliminated the barriers to entry for running Kubernetes in my organization. I loves it so; really I do. However the attitude in the general containerized-infrastructure community remains and I don’t expect it will get more inclusive as the rush towards ever more abstract infrastructure continues.

areed · February 4, 2020, 3:32pm

I also rolled back to 2.3.2 all is working again. Please advise when a fix is available.

westonmyers · February 4, 2020, 4:10pm

@areed Brought this down into Slack, where I’m following-up there.

While yet to be tested in his environment, it should be corrected by either updating the docker daemon’s service file to run the following command before dockerd itself or to make a new service with said command, that is a prerequisite to starting the docker daemon:
mount --make-rshared /

For “testing” you can stop your docker daemon, run that command, and restart the docker daemon. If everything works, then you need to edit your service files as mentioned or something.

Example from my Alpine host (OpenRC):

This is ultimately an “issue” with your environment and this becoming a requirement in upstream (Kubernetes). It obviously isn’t set up to be self-correcting as-is, which they’ve talked about up there. So not a Rancher issue per se, but you’re feeling the affects of upstream changes.

That is, if your issue is exactly this and is fixed as mentioned.

Feel free to gather additional help here or (Likely more quickly) on Slack.

vincent · February 4, 2020, 4:19pm

You are reading far too much out of tea leaves that don’t exist; there’s no vast conspiracy against on-premise people. The cloud providers are easier because they’re a concrete solution in a box. On your own hardware there are a hundred decisions you have to make that Google already made for you.

The docs mention MetalLB. There are many alternatives, including hardware devices that are popular among people that are already buying physical hardware to run VMWare on.

In fact you’ve chosen a particularly apt example, as we as a company are currently trying to save MetalLB by transitioning maintenance of it to us or friends, since the primary author is burned out and basically abandoning it (as is his right).

The reason your guys’ issue is not a “bigger issue” are:

It’s not an “issue” at all, you’re talking about it in forums instead of a GitHub issue where actual replicatable bugs are documented with steps that can be reproduced, confirmed by QA, and prioritized for fixing in a future release.
It’s not clear at all that any two of you actually have the same root problem. There are infinite ways to make a cluster that doesn’t work, and the two posts here with some form of documenting what error you’re getting back seem entirely unrelated.
Your problem description essentially takes the form of “X is broken for me, therefore it must broken for everyone”, which is rarely true. The entire feature of using VMWare to deploy VMs and make a cluster is not broken. There are lots of people using it just fine in >=2.3.3.
But a lot did change at that time including the version of the libraries that are used to talk to vsphere and the way we allow/encourage you to use templates for the VM instead of boot2docker. There were issues for vmware in each of those patches. So I don’t doubt you guys have a problem…
…but the relevant question is: what is special about your requests to make the cluster/vm or your external environment being used (e.g. version or configuration of vmware) that are making it fail in what specific way for your specific setup?

andy · February 4, 2020, 5:40pm

I’m happy to hear you’re taking MetalLB under your wing, and also that there is not a vast cloud-first conspiracy to deprive us of doing things the hard way.

I suppose the next logical step would be to file a bug report in Github—to be honest I kind of expected it was a problem with my specific implementation of VSphere and in posting to the forums, I thought I would receive wisdom that I had merely failed to configure the Java Mcdoodle 3 clicks to the left and then all would be well. Now we know there is a class of on-prem VSphere user who is experiencing this in common, which means it might actually be a bug.

Thanks for your thoughtful response. I will try to gather facts to do a proper report as I’m able.

Topic		Replies	Views
Failed to start [kubelet] container Rancher	1	2317	July 26, 2018
Rancher 2.x fails to connect to Kubernetes on VSphere - Fails on port listener Rancher	0	1660	May 22, 2018
Issues provisioning downstream cluster Rancher	0	348	March 28, 2023
[workerPlane] Failed to upgrade Worker Plane: [host xxx] Rancher	0	939	September 25, 2020
Has anyone Seen this: rshared/rprivate error on daemon root "/var/lib/docker" Rancher 1.x	1	2166	April 12, 2018

Rancher 2.3.5 no longer able to provision nodes using vmware on prem plugin

Related topics