Monitoring Rancher with Nagios

So, we use Nagios here, and need to link it in to Rancher.

Since Rancher has a good authenticated API, what I would really like to have is a check_rancher Nagios plugin that would return the health of the cluster. Possibly, it should take an Environment name as a parameter, or the API endpoint URL. It should verify disk space remaining on hosts, host availability and host CPU/memory usage, with configurable thresholds.

Nobody seems to have made one of these yet… I may end up doing it myself (can’t be that difficult) but don’t want to end up reinventing the wheel.

Any other Nagios users out there?

We use Nagios (check_mk variant). To be honest it’s not great at monitoring what i would classify as dynamic services such as container platforms etc. That being said it’s still pretty good at monitoring the hosts etc.

The API however does make it easy to get this info, we ended up deploying prometheus to give us greater container monitoring detail. You could however (like you say) monitor the services through the API. I’d make sure to keep the host level monitoring such as disk usage separate to the service monitoring you choose to do through the API. If your set on nagios then you could run the agent in a container on yours hosts and use rancher to schedule and manage that, this would give you all the metrics your after straight into nagios.

I’d envision a plugin that can monitor environment health (check cpu/mem/disk on all Hosts) and Stack health (check stack active/degraded/down). This should be sufficient, and wouldn’t need to have too much complexity. In Nagios, you’d create a virtual dummy ‘host’ for each environment, and have services under it for cpu/mem/disk and for monitored Stacks, so no need to dynamically move everything about. The individual hosts are just lightweight easily-rebuild docker servers anyway.

I have started on a plugin for Nagios and MRTG using the Rancher API. You can grab a copy of the alpha at https://github.com/sshipway/check_rancher

Comments, suggestions, etc welcome. I’ll be adding a lot to this over the Christmas break I think.

check_rancher now supports configuration via host labels; monitoring of Environments, Stacks, Hosts. Checks cpu, memory, load average, disk space.

For MRTG it can graph disk space, and Environment average CPU/Mem usage.

Next to add - Certificate expiry, Host Swap activity, individual host/disk usage.

There is a problem that the Stack ‘Degraded’ status is not exposed via the API, making it hard to identify on services with more complex scheduling rules. Also, Certificate objects only expose expiry time as a date string that needs to be parsed, not as a ‘time remaining’ integer, but I can work around that.