Removing node from downstream server

We need to migrate the content of /var/lib/rancher to a secondary disk because of space problems.
In order to do that, we thought that:

  • making a first rsync to the secondary disk
  • stopping and disabling rancher-agent and rke2 server services
  • rebooting the node
  • rsync again
  • clear the current content of /var/lib/rancher
  • mount the secondary disk under /var/lib/rancher
  • enable and start rancher-agent and rke2-server

would do the trick. Unfortunately not!

Unfortunately, the first rsync takes ages to complete.
Upon restart of the services, the pods keep bouncing and the node never finishes converging.

I then tried to follow the steps described here: Removing Kubernetes Components from Nodes | Rancher

The idea is to remove the node from the cluster. Clean the /var/lib/rancher and re-provision the node once the secondary disk is mounted.

As described, I’m going to Rancher UI and deletes the node from the cluster.
According to the documentation, the deletion process should trigger a “cleanup”.

It does not seem to be the case: the node automatically gets back in the cluster and the folders are not cleared.

Am I doing something wrong?
Should the script system-agent-uninstall.sh be called at some point?

Hi jaepetto,
We hade the same problem, after you remove the node from the cluster you need to manually clean up the RKE2 components. You can use the following guide: Cleaning up Nodes

1 Like

Hi @marco76 ,

thanks for the tip. That’s what we finally end-up doing.
Basically, the flow looks like this:

  • Download and run the system-agent-uninstall.sh script
  • reboot
  • run the rke2-uninstall.sh script
  • reboot again
  • Make the changes on the machine (here, mount our secondary disk under /var/lib/rancher)
  • (manually remove the node and the node and the machine in Rancher UI)
  • Re-introduce the machine in the cluster with the registration script (system-agent-install.sh)

Unfortunately, this process is really far from perfect:

  • each reboot takes up to 10 minutes
  • it involves some manual processing (this is probably a different topic)
  • we had to do it on 75 machines…

I really hoped there was a better way to do this since we’ll have to do it again in the near future (configuring some LACP which will have an impact on the network cards names, which, in turn, will have an impact on Calico / IPtables).

Emmanuel