On-Premise VSphere provisioning with Rancher 2.3.3 and 2.3.4

Greetings, Rancher friends,

I’ve been using Rancher for about a year and I’ve been loving it. Recently, I updated one of my controllers to Rancher 2.3.3, and then attempted to scale my cluster with a new node. Instead of the normal “yes sir, right away sir” with a new chunk of compute being carved out, I got back:

Error creating machine: Error in driver during machine creation: 500 Internal Server Error

How puzzling! I fiddled with it for a while, but being pressed for time, I downgraded my cluster back to 2.3.2, at which time I was able to provision my new node and go about my business.

Today, I’ve been setting up an entirely new Rancher controller in a new isolated on-premise network segment (our requirements are that each client have their own fully-independent data platform) and saw there was a release of rancher/rancher docker tag v2.3.4 - “How exciting!” I exclaimed, figuring that provisioning weirdness would be gone. But, alas, it’s not.

The Problem

When I set up my cloud credentials and node template for VCenter and then launch a cluster, no matter whether I use the old-school Boot2Docker image, or clone from my local RancherOS 1.5.5 template, the VM will come online and then, when it attempts to load the user-data.iso file into my main NAS volume, the web interface will return with the 500 Internal Server Error message. Looking deeper, in my VCenter logs, I find these messages:

[NFC ERROR] NfcFssrvrProcessErrorMsg: received NFC error 4 from server: NfcFssrvrOpen: Failed to open '[Argus VMFS 1]VMFS/user-data.iso’
(followed by)
[NFC ERROR] NfcFssrvr_FileOpen: Failed to open file '[Argus VMFS 1]VMFS/user-data.iso': A file error was encountered (NFC_FILE_ERROR)

Further, when I tried to fine tune my configuration on the (very nice) new VSphere configuration options, inevitably I have to specify custom cloud_provider YAML options for vsphereCloudProvider. However, when I finish the configuration and launch the new cluster, I see the same problem and eventually discovered when I went in to edit the ever-provisioning-and-dying cluster, that all my custom cloud provider parameters had been removed from the configuration. They weren’t there!

So, the questions are: Is Rancher ignoring my cloud provider directives, causing it to not know how to talk to my VMFS storage, even though I have cloud credentials and storage options defined elsewhere? Is there some issue with YAML validation on the cloud provider config that would cause it to be thrown out? Is this because my VCenter cluster is using a self-signed certificate and my global.insecure-flag true directive is being ignored? Am I being dumb in some other obvious way concerning network topology, credentials, or certificate management? It’s vexing to me that this should only be an issue in the last two point releases. Please advise on the proper way to do this!

Bump. Looks like another user has run into the same issue: Rancher 2.3.5 no longer able to provision nodes using vmware on prem plugin

1 Like

Likewise, I attempted to upgrade my development environment to see if the issue has been resolved. It has the same issue so we’re still stuck on 2.3.2.