How are you monitoring services in Rancher?

Just curious what others are doing…
We still aren’t able to find a right tool for the job.

*Prometheus getting data from:

  • cAdvisor to collect container metrics
  • node exporter to collect hosts metrics
  • prometheus-rancher-exporter to export specific rancher metrics

*Elasticsearch getting data from:

  • logstash+logspout getting logs from docker (and all it’s containers)
  • metricbeat from non-rancher hosts (nfs storages)
  • graylog from non-rancher nodes logs (syslogs from nfs storages)
  • Grafana WorldPing for url monitoring (response time, ping, etc)

All of that with dashboards on Grafana with some alerts enabled to send webhooks and take some automated actions

We use Sematext cloud metrics+logs for our Rancher clusters. They provide a sematext-agent that runs on every host. The agent collects all logs and metrics from running containers. Logs can be pre-processed by the agent to provide structured logging. The Sematext cloud UI for logging has Elasticsearch built in, which has been very handy for us.

We’ve been using newrelic infrastructure to monitor hosts stats, so far it’s working well and super easy to set up:

version: '2'
services:
  NewRelicInfrastructure:
    image: newrelic/infrastructure
    environment:
      NRIA_CUSTOM_ATTRIBUTES: '{"environment":"dev","type": "rancher"}'
      NRIA_LICENSE_KEY: keyHere
    stdin_open: true
    network_mode: host
    volumes:
    - /:/host:ro
    - /var/run/docker.sock:/var/run/docker.sock
    tty: true
    labels:
      io.rancher.container.pull_image: always
      io.rancher.scheduler.global: 'true'

WOW, That’s a lot to process. Can you tell me what part of this systems are used when, say, alerting of a failed service(running on container) happens? Like web server process crash.

Datadog, works fine for us, and pulls all the data we need.

im attaching a sidekick filebeat container to grab app logs, I find it gives me a bit more control than other solutions

I also suggest prometheus & grafana for monitoring and alerting and syslog-ng (on the host) to forward logs.
I’m using one prometheus per rancher environment.
And then a prometheus and grafana outside of rancher which should be highly avalability.

To collect the metrics of ranger services I have a confd setup to generate the prometheus config: https://github.com/marcbachmann/rancher-prometheus-config

To monitor hosts you’ll want to have a node-exporter and cadvisor in your default setup. In my setup I installed them on the host directly but you could use a rancher service.