Rancher Monitoring is falsely alerting on /var/lib/lxcfs running out of disk space

shubbard343 · June 11, 2019, 7:58pm

We have turned on Monitoring on our Rancher 2.2.4 clusters, and as soon as we added a Slack notifier, we started receiving messages about the node disk running full within 24 hours.

Alert Name: Node disk is running full within 24 hours
Severity: critical
Cluster Name: cluster1 (ID: c-ct7pv)
Namespace: cattle-prometheusPod Name: exporter-node-cluster-monitoring-qjdps
Expression: predict_linear(node_filesystem_files_free{mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[6h], 3600 * 24)<=1
Description: Threshold Crossed: datapoint value 0 was less or equal to the threshold (1) for (10m)

(Notice a missing \n in front of Pod Name on the 5th line)

Looking at the alert, it is monitoring the path /var/lib/lxcfs, which shows up as a mount point in mount, but not in df since we have not mounted that path.

lxcfs

The default alert expression for this is:
predict_linear(node_filesystem_files_free{mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[6h], 3600 * 24)

We changed it to the following so that it only excludes that mountpoint, and now the alert is no longer firing, but it still shows the actual mounted filesystems that we care about.

predict_linear(node_filesystem_files_free{mountpoint!~"/var/lib/lxcfs"}[6h], 3600 * 24)

dwene · October 21, 2019, 2:41am

Thank you! This fixed that annoying alert right up. I modified your alert to include the defaults rancher had.

predict_linear(node_filesystem_files_free{mountpoint!~"^/(?:etc/resolv.conf|etc/hosts|etc/hostname|var/lib/lxcfs)$"}[6h], 3600 * 24)

Topic		Replies	Views
False alert for 'Node disk is running full within 24 hours' Rancher	2	1694	September 9, 2019
monitoring prometheus/grafana Rancher	0	527	July 11, 2019
Rancher monitoring app shows not all mounted node filesystems Rancher	1	466	May 23, 2021
Df shows thousands of /var/run filesystems, is this normal? Rancher 1.x	4	1446	September 13, 2017
Persistent Volume Capacity Metrics Rancher	2	878	September 11, 2022

Rancher Monitoring is falsely alerting on /var/lib/lxcfs running out of disk space

Related topics