Rancher 2.2 - Enabling Prometheus causes GUI crash

Has anyone worked around this issue?

With the recent release of Rancher 2.2, monitoring was made available, via a helm install of a Prometheus operator.

Enabling this monitoring seems to hit a bug in Helm that causes instability and crashes the Rancher gui.

See:

Steps to Recreate:

Step 1: Create an EKS cluster in AWS or your provider of choice, using these instructions.

Step 2: Enabling Monitoring:
In the Rancher GUI, navigate to:
Cluster > Cluster Name > Tools > Monitoring > Enable Monitoring > Save

Step 3: Examine docker logs:
docker logs -f containerid

2019/03/22 20:07:58 [ERROR] AppController p-tjjnb/monitoring-operator [helm-controller] failed with : failed to install app monitoring-operator. Error: validation failed: unable to recognize "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"

What do you mean by it crashes the Rancher GUI?

Hi cjellick:

I restarted Rancher and took some stats while Helm tried to install Prometheus.
Eventually, the gui becomes unstable and won’t return the cluster page. [https://localhost/c/c-2cmqm]

I’m new to Rancher, but if you could direct me tom some better logs, I can provide them.

Below is some information on the:

  • Version
  • Docker Stats
  • Screenshots of GUI

Version:
https://localhost/v3/settings/server-version

  • “name”: “server-version”,
  • “source”: “env”,
  • “type”: “setting”,
  • “uuid”: “376bf4fd-4fde-11e9-b4c1-0242ac110002”,
  • “value”: “v2.2.0”

Docker Stats: The CPU does spike to +300% sometimes.
Captured via while true; do sudo docker stats -a --no-stream >> stats.txt; done

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
694760e998f6        rancher             213.22%             1.309GiB / 1.952GiB   67.07%              159MB / 160MB       219MB / 2.06GB      80
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
694760e998f6        rancher             186.48%             1.301GiB / 1.952GiB   66.68%              159MB / 160MB       219MB / 2.06GB      90
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
694760e998f6        rancher             155.89%             1.308GiB / 1.952GiB   67.04%              159MB / 160MB       219MB / 2.06GB      81
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
694760e998f6        rancher             158.79%             1.303GiB / 1.952GiB   66.75%              159MB / 160MB       219MB / 2.06GB      69
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
694760e998f6        rancher             259.55%             1.296GiB / 1.952GiB   66.40%              159MB / 160MB       219MB / 2.06GB      86

Screenshot of GUI
Errors started appearing roughly after 5-10 minutes of Monitoring being enabled.
It looks like Helm continuously tries to install, judging on the new folders appearing in /tmp every second or so.

https://localhost/g/clusters > Try to click on my cluster

Eventually, the following error appears, and then the site cannot be reached.

This site can’t be reached localhost unexpectedly closed the connection.
Try:

Checking the connection
Checking the proxy and the firewall

You’re not going to be able to run rancher and monitoring on a node with 2GB of RAM. Kubernetes (inside the rancher container) and Prometheus/etc all like to burn a lot of resources and you’re probably just getting into the kennel killing random processes because out of memory.

Is Kubernetes/Prometheus running inside the rancher container?

I was under the impression that the Prometheus operator was being deployed to the cluster in AWS that was provisioned by Rancher.

See below for a sample of the Docker stats.

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
694760e998f6        rancher             165.33%             1.357GiB / 1.952GiB   69.55%              36.9MB / 11.3MB     296MB / 146MB       73

Open this GH issue https://github.com/rancher/rancher/issues/19274
Please continue the conversation there