Disk Pressure breaks cluster

Hi all!

I am having issues with my Kube cluster setup with rke and managed by Rancher (2.1.3) when one node gets disk pressure.

Currently using 4 nodes (4vCPU, 120GB disk and 12GB RAM) for a PoC of a future-productive workload. OS is Linux Ubuntu 18.04.

I am hosting the Block-storage Persistent Volumes as Containers on the same hosts as the nodes with OpenEBS Storage driver.

Once a node reaches 90% disk usage, it starts draining all its pods, and all the pods start moving to the remaining nodes. This leads the problem, that the other nodes need to host the disk-pressured-one’s disk images, leading to a cascade of running their file-systems full, becoming disk-pressured, too, and eventually the whole cluster breaks.

Have you experienced such problems and how to recover from there?
Kubernetes should also run some Garbage Collection on old Docker Images Kubelet FS, etc. how can I check this?

Also in case of OpenEBS (it’s based on Longhorn) the (test) disk-images are size limited to be, e.g., 10GB but in some cases utilize 18GB on disk due to snapshots.

I would appreciate any help.

Merry XMas

You can adjust the threshold if you really want, but the cascading failure is ultimately because you have a workload that requires “n” nodes and less than “n+1” of them in the cluster.

If one node dies for any reason, you’re going to end up not being able to service of all the workload the cluster is supposed to be running. The disk pressure threshold is just causing it at 90% when things can be shut down gracefully, instead of 100% and crashing. Or a Coke being spilled into one node.