Rancher 2.2 - Enabling Prometheus causes GUI crash

JustinK · March 26, 2019, 5:07pm

Has anyone worked around this issue?

With the recent release of Rancher 2.2, monitoring was made available, via a helm install of a Prometheus operator.

Enabling this monitoring seems to hit a bug in Helm that causes instability and crashes the Rancher gui.

See:

github.com/helm/charts

[stable/prometheus-operator] Validation fails because of missing CRDs even though they have been created

opened 07:23PM - 13 Nov 18 UTC

closed 10:47AM - 14 Nov 19 UTC

vsliouniaev

This is a **BUG REPORT** **Version of Helm and Kubernetes**: - Helm - 2….10+ - Kubernetes - 1.11.3 - 1.11.4 - 1.13.x **Which chart**: `stable/prometheus-operator` Probably all versions of the chart but occurred on 0.1.7, 0.1.21,4.0.0 **What happened**: Under some circumstances - apparently just the above helm and kubernetes version the install with default values fails with ``` Error: validation failed: [unable to recognize "": no matches for kind "Alertmanager" in version "monitoring.coreos.com/v1", unable to recognize "": no matches for kind "Prometheus" in version "monitoring.coreos.com/v1", unable to recognize "": no matches for kind "PrometheusRule" in version . . . . . . ``` There appears to be a race condition occurring because doing a `kubectl get crd | grep coreos` a few times will show the resources gradually appear (4 in total) but much later after the error has already occurred and the chart install has failed. This is reproduceable multiple times if the resources are deleted and the chart installation is attempted again. **What you expected to happen**: The `crd-install` hook used to create the 4 CRDs in this chart should either succeed or fail. **How to reproduce it**: Attempt to install the chart on a cluster without the the coreos CRDs and the install fails ** UPDATE ** A proposed fix for this issue can be seen here: https://github.com/helm/helm/pull/5112

github.com/helm/helm

fix(helm): Wait for CRDs to reach established state for crd_install hook

helm:master ← mortent:WaitCRDEstablished

opened 09:51AM - 29 Dec 18 UTC

mortent

+225 -5

**What this PR does / why we need it**: There is a race condition in the crd_install hook implementation, where there is a chance a CRD is not yet ready by the time CRs are being created. This is reported in issue #4925. This change makes sure CRDs installed through the crd_install hook reaches the `established` state before the hook is considered complete. Fixes #4925 **Special notes for your reviewer**: Unit-testing code in the kubernetes client is difficult, as the builder/infos is tightly coupled with the API server. I will look into how to improve testing for this part of the codebase, but I would like to separate it from this PR. **If applicable**: - [x] this PR contains documentation - [x] this PR contains unit tests - [x] this PR has been tested for backwards compatibility

Steps to Recreate:

Step 1: Create an EKS cluster in AWS or your provider of choice, using these instructions.

Step 2: Enabling Monitoring:
In the Rancher GUI, navigate to:
Cluster > Cluster Name > Tools > Monitoring > Enable Monitoring > Save

Step 3: Examine docker logs:
docker logs -f containerid

2019/03/22 20:07:58 [ERROR] AppController p-tjjnb/monitoring-operator [helm-controller] failed with : failed to install app monitoring-operator. Error: validation failed: unable to recognize "": no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"

cjellick · March 26, 2019, 7:56pm

What do you mean by it crashes the Rancher GUI?

JustinK · March 27, 2019, 1:58pm

Hi cjellick:

I restarted Rancher and took some stats while Helm tried to install Prometheus.
Eventually, the gui becomes unstable and won’t return the cluster page. [https://localhost/c/c-2cmqm]

I’m new to Rancher, but if you could direct me tom some better logs, I can provide them.

Below is some information on the:

Version
Docker Stats
Screenshots of GUI

Version:
https://localhost/v3/settings/server-version

“name”: “server-version”,
“source”: “env”,
“type”: “setting”,
“uuid”: “376bf4fd-4fde-11e9-b4c1-0242ac110002”,
“value”: “v2.2.0”

Docker Stats: The CPU does spike to +300% sometimes.
Captured via while true; do sudo docker stats -a --no-stream >> stats.txt; done

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
694760e998f6        rancher             213.22%             1.309GiB / 1.952GiB   67.07%              159MB / 160MB       219MB / 2.06GB      80
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
694760e998f6        rancher             186.48%             1.301GiB / 1.952GiB   66.68%              159MB / 160MB       219MB / 2.06GB      90
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
694760e998f6        rancher             155.89%             1.308GiB / 1.952GiB   67.04%              159MB / 160MB       219MB / 2.06GB      81
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
694760e998f6        rancher             158.79%             1.303GiB / 1.952GiB   66.75%              159MB / 160MB       219MB / 2.06GB      69
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
694760e998f6        rancher             259.55%             1.296GiB / 1.952GiB   66.40%              159MB / 160MB       219MB / 2.06GB      86

Screenshot of GUI
Errors started appearing roughly after 5-10 minutes of Monitoring being enabled.
It looks like Helm continuously tries to install, judging on the new folders appearing in /tmp every second or so.

https://localhost/g/clusters > Try to click on my cluster

JustinK · March 27, 2019, 2:19pm

Eventually, the following error appears, and then the site cannot be reached.

This site can’t be reached localhost unexpectedly closed the connection.
Try:

Checking the connection
Checking the proxy and the firewall

vincent · March 27, 2019, 3:50pm

You’re not going to be able to run rancher and monitoring on a node with 2GB of RAM. Kubernetes (inside the rancher container) and Prometheus/etc all like to burn a lot of resources and you’re probably just getting into the kennel killing random processes because out of memory.

JustinK · March 27, 2019, 5:14pm

Is Kubernetes/Prometheus running inside the rancher container?

I was under the impression that the Prometheus operator was being deployed to the cluster in AWS that was provisioned by Rancher.

See below for a sample of the Docker stats.

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
694760e998f6        rancher             165.33%             1.357GiB / 1.952GiB   69.55%              36.9MB / 11.3MB     296MB / 146MB       73

cjellick · March 29, 2019, 5:26pm

Open this GH issue https://github.com/rancher/rancher/issues/19274
Please continue the conversation there

Topic		Replies	Views
Rancher2, failed enable default monitoring	1	965	September 16, 2019
Rancher-monitoring is stuck in uninstalling state	1	327	May 28, 2024
Detected error on rancher-monitoring (duplicate YAML key) Rancher	0	318	October 16, 2021
Monitoring (102.0.0+up40.1.2) - Failed to Install VIA UI Rancher	0	796	February 5, 2023
Rancher 2.x desktop installation Rancher	0	478	June 27, 2019

Rancher 2.2 - Enabling Prometheus causes GUI crash

Related topics