We have turned on Monitoring on our Rancher 2.2.4 clusters, and as soon as we added a Slack notifier, we started receiving messages about the node disk running full within 24 hours.
Alert Name: Node disk is running full within 24 hours
Severity: critical
Cluster Name: cluster1 (ID: c-ct7pv)
Namespace: cattle-prometheusPod Name: exporter-node-cluster-monitoring-qjdps
Expression: predict_linear(node_filesystem_files_free{mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[6h], 3600 * 24)<=1
Description: Threshold Crossed: datapoint value 0 was less or equal to the threshold (1) for (10m)
(Notice a missing \n
in front of Pod Name
on the 5th line)
Looking at the alert, it is monitoring the path /var/lib/lxcfs
, which shows up as a mount point in mount
, but not in df
since we have not mounted that path.
The default alert expression for this is:
predict_linear(node_filesystem_files_free{mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[6h], 3600 * 24)
We changed it to the following so that it only excludes that mountpoint, and now the alert is no longer firing, but it still shows the actual mounted filesystems that we care about.
predict_linear(node_filesystem_files_free{mountpoint!~"/var/lib/lxcfs"}[6h], 3600 * 24)