Got an rke2 cluster provisioned, but we have to intervene with the deployment process in order to get this to work. Once we provision the cluster via the Rancher Ui, it gets stuck in “Waiting for agent to check in and apply initial plan“.
When we ssh onto the node we can see that this service does not exist.
xxxxx:~ # systemctl status rancher-agent-service
Unit rancher-agent-service.service could not be found.
We can then see in cloud init /var/lib/cloud/instance/user-data.txt a path to an install script path: /usr/local/custom_script/install.sh
In the cloud init logs we see
2022-11-30 14:49:22,724 - util.py[DEBUG]: Writing to /usr/local/custom_script/install.sh - wb:  29921 bytes
2022-11-30 14:49:22,725 - util.py[DEBUG]: Changing the ownership of /usr/local/custom_script/install.sh to 0:0
We see these permissions on the script
-rw-r–r-- 1 root root 29921 Nov 30 09:49 install.sh
And we are unable to run the script manually
xxxxxx:/usr/local/custom_script # ./install.sh
-bash: ./install.sh: Permission denied
Once we change the permissions using chmod +x install.sh, we can run the script
xxxxxx:/usr/local/custom_script # ./install.sh [INFO] --no-roles flag passed, unsetting all other requested roles [INFO] Using default agent configuration directory /etc/rancher/agent [INFO] Using default agent var directory /var/lib/rancher/agent [WARN] /usr/local is read-only or a mount point; installing to /opt/rancher-system-agent [INFO] Determined CA is necessary to connect to Rancher [INFO] Successfully downloaded CA certificate [INFO] Value from https://ranche-xxxx-.com/cacerts is an x509 certificate [INFO] Successfully tested Rancher connection [INFO] Downloading rancher-system-agent binary from https://rancherxxxx.com/assets/rancher-system-agent-amd64 [INFO] Successfully downloaded the rancher-system-agent binary. [INFO] Downloading rancher-system-agent-uninstall.sh script from https://rancher-xxxx.com/assets/system-agent-uninstall.sh [INFO] Successfully downloaded the rancher-system-agent-uninstall.sh script. [INFO] Generating Cattle ID curl: (28) Operation timed out after 60001 milliseconds with 0 bytes received [ERROR] 000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again [INFO] Successfully downloaded Rancher connection information [INFO] systemd: Creating service file [INFO] Creating environment file /etc/systemd/system/rancher-system-agent.env [INFO] Enabling rancher-system-agent.service Created symlink /etc/systemd/system/multi-user.target.wants/rancher-system-agent.service → /etc/systemd/system/rancher-system-agent.service. [INFO] Starting/restarting rancher-system-agent.service
This then now starts the rancher-system-agent.service and from there the cluster will provision successfully.
The issue we need help with then is why does this install.sh script not run?