Removing node from downstream server

jaepetto · July 23, 2024, 4:45am

We need to migrate the content of /var/lib/rancher to a secondary disk because of space problems.
In order to do that, we thought that:

making a first rsync to the secondary disk
stopping and disabling rancher-agent and rke2 server services
rebooting the node
rsync again
clear the current content of /var/lib/rancher
mount the secondary disk under /var/lib/rancher
enable and start rancher-agent and rke2-server

would do the trick. Unfortunately not!

Unfortunately, the first rsync takes ages to complete.
Upon restart of the services, the pods keep bouncing and the node never finishes converging.

I then tried to follow the steps described here: Removing Kubernetes Components from Nodes | Rancher

The idea is to remove the node from the cluster. Clean the /var/lib/rancher and re-provision the node once the secondary disk is mounted.

As described, I’m going to Rancher UI and deletes the node from the cluster.
According to the documentation, the deletion process should trigger a “cleanup”.

It does not seem to be the case: the node automatically gets back in the cluster and the folders are not cleared.

Am I doing something wrong?
Should the script system-agent-uninstall.sh be called at some point?

marco76 · July 24, 2024, 3:45pm

Hi jaepetto,
We hade the same problem, after you remove the node from the cluster you need to manually clean up the RKE2 components. You can use the following guide: Cleaning up Nodes

jaepetto · July 25, 2024, 6:00am

Hi @marco76 ,

thanks for the tip. That’s what we finally end-up doing.
Basically, the flow looks like this:

Download and run the system-agent-uninstall.sh script
reboot
run the rke2-uninstall.sh script
reboot again
Make the changes on the machine (here, mount our secondary disk under /var/lib/rancher)
(manually remove the node and the node and the machine in Rancher UI)
Re-introduce the machine in the cluster with the registration script (system-agent-install.sh)

Unfortunately, this process is really far from perfect:

each reboot takes up to 10 minutes
it involves some manual processing (this is probably a different topic)
we had to do it on 75 machines…

I really hoped there was a better way to do this since we’ll have to do it again in the near future (configuring some LACP which will have an impact on the network cards names, which, in turn, will have an impact on Calico / IPtables).

Emmanuel

Topic		Replies	Views
Node stuck in deleting Rancher	2	2130	November 17, 2023
How to safely scale down number of nodes	0	1032	August 18, 2020
[newbie]Delete node Rancher	3	833	June 16, 2020
Node Stuck at "Waiting for node to register..." When Trying to Rejoin Cluster Rancher	2	8645	January 24, 2022
Trying to remove broken Rancher cluster node (rke) Rancher	0	2171	June 19, 2020

Removing node from downstream server

Related topics