Rancher Monitoring is falsely alerting on /var/lib/lxcfs running out of disk space

We have turned on Monitoring on our Rancher 2.2.4 clusters, and as soon as we added a Slack notifier, we started receiving messages about the node disk running full within 24 hours.

Alert Name: Node disk is running full within 24 hours
Severity: critical
Cluster Name: cluster1 (ID: c-ct7pv)
Namespace: cattle-prometheusPod Name: exporter-node-cluster-monitoring-qjdps
Expression: predict_linear(node_filesystem_files_free{mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[6h], 3600 * 24)<=1
Description: Threshold Crossed: datapoint value 0 was less or equal to the threshold (1) for (10m)

(Notice a missing \n in front of Pod Name on the 5th line)

Looking at the alert, it is monitoring the path /var/lib/lxcfs, which shows up as a mount point in mount, but not in df since we have not mounted that path.

lxcfs

The default alert expression for this is:
predict_linear(node_filesystem_files_free{mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[6h], 3600 * 24)

We changed it to the following so that it only excludes that mountpoint, and now the alert is no longer firing, but it still shows the actual mounted filesystems that we care about.

predict_linear(node_filesystem_files_free{mountpoint!~"/var/lib/lxcfs"}[6h], 3600 * 24)

4 Likes

Thank you! This fixed that annoying alert right up. I modified your alert to include the defaults rancher had.

predict_linear(node_filesystem_files_free{mountpoint!~"^/(?:etc/resolv.conf|etc/hosts|etc/hostname|var/lib/lxcfs)$"}[6h], 3600 * 24)

1 Like