We have a small cluster setup with some nodes ( running longhorn 0.7.0 ) that have rather small disk resources.
The default longhorn setting for Storage Minimal Available Percentage is 10%.
It seems the default trigger for a k8s node DiskPressure signal is 15%.
Thus in the default setup it is likely that longhorn exhausts disk space of a node ( because there is still more than 10% of disk space avail ) and triggers DiskPressure because node now has less than 15% of disk space avail.
Shouldn’t longhorn take into account the current level of free disk space that would trigger a DiskPressure too to avoid longhorn getting evicted from current node ?
In such a scenario the replicated volumes get re-scheduled to other nodes, that again may trigger the DiskPressure problem.
How can one avoid such chain-reaction ?
With such limited resources I already got two incidences where multiple volumes on nodes got lost because there was one node with DiskPressure in the cluster.
Yeah, currently we didn’t take the available space into the consideration of the DiskPressure (though we did take that into consideration when scheduling a replica). We should do that. Can you help to file a github issue for it?
One thing might help in the currently release to reduce the OverProvisioningPercentage to e.g. 100% or even less to stop Longhorn from overprovisioning the volume.
You can also add dedicated disks to Longhorn to prevent competing with system and Kubernetes processes.
I have already reduced OverProvisioningPercentage to 125% but soon afterwards the same DiskPressure issue happened again.
My current setup is StorageMinimalAvailablePercentage = 20% and OverProvisioningPercentage = 100%.
I will try to attach an additional disk for longhorn to some of my nodes to improve stability.