What are the best practices when rebooting a Longhorn host?
History:
We have a 3-node Kubernetes cluster running Rancher 2.0 on top. We had to reboot one of the hosts for maintenance, so we drained the node from Kubernetes, and powered it down. When we powered it back up, we noticed in Longhorn that for our volumes which had 3 replicas specified, two replicas were on one host, and the third replica was on the second host. Longhorn did not migrate the replica back to the third host after it came back up.
Is there a way to tell Longhorn to always keep the replicas on different hosts?
Longhorn would try to prevent disrupting the data path unless it’s necessary (e.g. one host has been lost). In your case, the replica on the powered down host is considered lost because it’s not reachable when you reboot the host, so Longhorn has rebuilt the replica on another available host to keep volume healthy.
Longhorn won’t automatically detect the availability of the new host and move the load there automatically. Maybe we can add a feature for that later to auto balancing the load between nodes. But it would be pretty costly to migrate the data constantly.
For now, since you have 3 replicas, you can deliberately delete one of the two replicas on the same node. It should trigger the rebuilding process of Longhorn and the system will take a look at current status of the nodes, and it should decide that the third node is a better place for the new replica due to the soft anti-affinity rule we’ve set.
After we add support for update replica counts for the volume (https://github.com/rancher/longhorn/issues/299) you should able to increase the replica count temporarily to trigger the rebuilding process, then decrease the replica count to the normal, and finally remove the extra replicas.
For now, since you have 3 replicas, you can deliberately delete one of the two replicas on the same node. It should trigger the rebuilding process of Longhorn
That’s what we did to get it back to the third node. The problem is that if we have dozens or more volumes, it is going to be a lot of work to delete all of the replicas to get them moved back. And it’s un-necessary writing and deleting to the drives for a temporary state change.
What would be nice if there was a hold-timer that would prevent the rebuilding of a lost volume for a period of time. Or something similar to the CEPH command ceph osd set noout so that it won’t rebuild the replica if you are doing maintenance.
In Longhorn’s design, the replicas are not supposed to be lost at any point, otherwise, it will become unhealthy and trigger Longhorn’s rebuild.
Temporarily stop the rebuild is doable, but after the node reboots, Longhorn cannot use the same replica, because it’s already out of sync with others.