Gather cadvisor stats into external system e.g. zabbix

According to the Rancher article at http://rancher.com/comparing-monitoring-options-for-docker-deployments/

Note that Rancher runs CAdvisor on each connected host, and exposes a limited set of stats through the UI, and all of the system stats through the API.

Rancher already is using cadvisor to collect container and host stats, and bringing them together to the Rancher server. How can I leverage that to pull stats into an external monitoring service like Zabbix?

  1. Is there an events-driven option? If not…
  2. Where is the API documented? I assume I could have a zabbix or other agent that polls the Rancher API to gather the data?

@deitch Getting to the stats API for a container or host is a two step process. If you open a container in the API, you should see a link to “containerStats”. If you click that link, you get a url and a token. Append the token to the url as a parameter like ws://localhost:8080/v1/containerstats/<some uuid>/?token=<token value>.

You can then connect to that url (as a websocket), and you’ll start receiving stats. Same thing for hosts, but the link is ‘hoststats’.

Note that the link called just ‘stats’ in both containers and hosts is legacy that we intend to remove.

Thanks @cjellick.

  1. What is the format of the data that comes across that Websocket?
  2. What authentication does that Websocket require? Is it that token you mentioned?

I don’t assume there are any existing libraries or CLIs for this? I could see a real value in having direct integration between Rancher and 3rd party data management/monitoring systems.

If I had the time I would build it, but gotta focus on the deadlines set by people who pay me to do work. :slight_smile:

@deitch the auth is the token. It is a jwt token, that was signed by a private key that rancher-server. The thing doing the authenticating (websocket-proxy) has the public key.

The data looks like this:

[
  {
    "id": "1i12",
    "resourceType": "container",
    "memLimit": 1044586496,
    "timestamp": "2016-01-27T19:41:34.001446366Z",
    "cpu": {
      "usage": {
        "total": 31581057,
        "per_cpu_usage": [
          31581057
        ],
        "user": 10000000,
        "system": 20000000
      },
      "load_average": 0
    },
    "diskio": {},
    "memory": {
      "usage": 524288,
      "working_set": 245760,
      "container_data": {
        "pgfault": 3163,
        "pgmajfault": 0
      },
      "hierarchical_data": {
        "pgfault": 3163,
        "pgmajfault": 0
      }
    },
    "network": {
      "name": "eth0",
      "rx_bytes": 648,
      "rx_packets": 8,
      "rx_errors": 0,
      "rx_dropped": 0,
      "tx_bytes": 648,
      "tx_packets": 8,
      "tx_errors": 0,
      "tx_dropped": 0,
      "interfaces": [
        {
          "name": "eth0",
          "rx_bytes": 648,
          "rx_packets": 8,
          "rx_errors": 0,
          "rx_dropped": 0,
          "tx_bytes": 648,
          "tx_packets": 8,
          "tx_errors": 0,
          "tx_dropped": 0
        }
      ]
    },
    "filesystem": [
      {
        "device": "/dev/sda1",
        "capacity": 19507089408,
        "usage": 12288,
        "available": 0,
        "reads_completed": 0,
        "reads_merged": 0,
        "sectors_read": 0,
        "read_time": 0,
        "writes_completed": 0,
        "writes_merged": 0,
        "sectors_written": 0,
        "write_time": 0,
        "io_in_progress": 0,
        "io_time": 0,
        "weighted_io_time": 0
      }
    ],
    "task_stats": {
      "nr_sleeping": 0,
      "nr_running": 0,
      "nr_stopped": 0,
      "nr_uninterruptible": 0,
      "nr_io_wait": 0
    }
  }
}

I actually just pulled that from an actual websocket message sent across to the browser via chrome developer tools. I dont believe we have an official schema for this data.

The object is pretty much straight from cAdvisor, with id and resourceType keys added.

Most of the data is counters so you have to compare against the previous datapoint to get most useful metrics like CPU/network/disk usage (here is the UI code that does it, FWIW).

I think it would make more sense for us to expose a way for you to reach the cAdvisor API directly on the host (it’s currently bound to 127.0.0.1:9344 in rancher/agent). Then you could create a service that talks to it and pushes data to the 3rd party, without involving the Rancher API, the WebSocket wrapper we need for the UI, etc.

Thanks. I can work with that. Is a message sent with each sample (i.e. every 1 second)?

Actually, I am kind of mixed on that. On the one hand, yes, good to be able to talk on each host. I could deploy a Zabbix monitoring container, it does special stuff unique for my environment, but instead of monitoring basic metrics, it just talks to cAdvisor.

On the other hand, Rancher already is a collector system, knows all of the hosts and containers - which I can retrieve by walking the API. It sure would simplify deployments to be able to treat it as a complete system and just talk to Rancher to get the monitoring metrics.

Or did you mean a way to talk to the cAdvisor API directly via the Rancher server?

Each frame is an array and can contain multiple samples, so you should handle processing them all (I don’t think it will actually ever contain multiple in the current release, but that will be fixed someday).

In 0.56 the collection interval for cAdvisor was changed to 5 seconds, but the WebSocket aggregator still sends frames every 1 second, so you get the same frame 5 times every time. Next release will either fix that to not send duplicate data, or we’ll revert it to 1 second because it’s pretty much broken the UI graphs (https://github.com/rancher/rancher/issues/3216).

I meant something like a global service that bind-mounts in a unix socket for cAdvisor that rancher/agent [doesn’t, but could] puts in a known place. The service could collect the data for that one host, combine it with whatever info it needs from metadata, and report up to your 3rd party system. This is nice and simple, each host is responsible for reporting itself. The service is global so new hosts automatically get a container running it when they register.

If you want to get into container-level stats this is probably less practical, because you’d have to map the entries in cAdvisor to their entry in Rancher/metadata to get additional info about them that you want to report up.

The other way you’d have to:

  • Have an account that has access to all (API) projects (UI “Environment”) (like an admin)
  • Create a API key for it (the user, not a particular project… this functionality exists does not have a button in the UI)
  • Talk to the API to enumerate all the projects
  • Get the hostStats link for each one and open the WebSocket for each one
  • Massage data for all hosts and report up.
  • At some point that won’t scale very well with one container collecting stats for every host
    • or you’ll want some redundancy because if it dies you lose reporting for every host
    • so you’ll need to figure out to shard the responsibility of collecting each host/project
    • and/or have more than one process running
    • without sending duplicate data to the 3rd party system, because that’s probably bad (maybe not, depending on the system).
  • Also the token you get from the {host,container}Stats link is only authorized for the resources that existed at the time it was created. If a new host or container is created it doesn’t start showing up in that socket’s stream.
    • so you’ll have to listen for new ones and reconnect with a fresh token every time one pop up.

Hmm, yeah, I see the issues. What you are saying is:

  1. Having a service that is global (runs on every host) and collects the cAdvisor stats from the rancher agent could work with the current design. It needs to expose a bit more interface, but nothing more fundamental than that.
  2. Having the Rancher server act as a gateway to all container stats also could work, but requires some more fundamental changes or “surgery”, new permissions, new access types, a way to get all of the stats for all of the container and hosts through a single Web socket and have the token valid for all existing and future resources, etc. It is far from trivial.

The fundamental question is whether Rancher should be the API for managing and monitoring hosts and containers - Rancher as gateway to the entire environment - or Rancher should be the API just for managing, and each agent provides an interface for monitoring.

Architecturally, I think that Rancher as the complete gateway fits with its philosophy. It allows Rancher to be the entry point to your entire containerized environment. Whether you run at small-scale or large-scale, Rancher abstracts the whole thing out for you, while providing enough detail to drill down if you want.

But I understand that it requires a lot more work to get there.