Rancher is basically the central point for RKE provisioning. When using the RKE CLI standalone, you have to configure a YAML file (cluster.yml) with nodes and the cluster configuration. In Rancher, the nodes are registered in Rancher (custom nodes can be registered with the command, the node drivers can create machines in a provider and will also run that command but then automatically), and Rancher triggers the RKE provisioning process with the registered nodes and the configuration that is present for that cluster in Rancher.
If you want to troubleshoot this, the starting point is Rancher and its logging. It will log the full provisioning process (similar to RKE CLI’s rke up), and you can change the log level to get more info. Obviously the second part is the node itself, the containers created and started will log the startup process and can indicate why it is not able to start.
To trigger cluster provisioning, you need to modify the cluster in Rancher. As there is no button to trigger provisioning, and if you don’t have anything to change in the cluster but just want to kick off provisioning again, you can modify something that is irrelevant for you to change, for example, changing addon_job_timeout from 30 to 31.
Let me know if this helps, and if you have specific or more technical questions, I can try to answer that as well.
The scenario will include a failed control plane host node and the cluster change status to something similar as “failed” as most of things are greyed out now.
Usually this is happening on top of AWS where it’s way too simple to replace a node (we are doing some disaster tests).
With usual, lets apply usual word, rke up with rke binary I may change YAML file to remove the damaged node; with Rancher Manager I’m seeing the reconciliation taks trying to reach via SSH the deleted node (because the node info it’s still present in Rancher Manager Custom Cluster).
I tried to already to patch the node’s finalizer to remove from cluster, that shows in Rancher Manager UI, but the issue it’s the same; reconciliation task it’s trying to SSH to the node and stays hanged with a message similar to “reconciliation”.
But I will try to see, with a cluster in failed state, if I’m able to change any item from cluster’s config and see if the refresh/kick off reconciliation with new settings (AKA without the affected node).