Intermittent crash loops on the Rancher Monitoring Prometheus container

Howdy!
We are experience intermittent crash loops on the Rancher Monitoring Prometheus container prometheus-rancher-monitoring ? Thanks for the help in advance!

  • I cant find any errors in the logs
  • it seems to do it at random
  • The Prometheus server throws 503’s prior to crashing

cluster info:

  • Rancher v2.5.8
  • v1.20.6-rancher1-1
  • AWS Infra provisioned
  • rancher-monitoring:14.5.100
  • rancher-monitoring-crd:14.5.100

I have tried the following:

  • Rolling all Worker and CP nodes
  • Re-installing after removing secrets / config left over
  • Installing w/ and w/o persistent storage

Log files on the prometheus-rancher-monitoring container, after crash:

level=info ts=2021-10-13T14:54:54.055Z caller=main.go:364 msg="Starting Prometheus" version="(version=2.24.0, branch=HEAD, revision=02e92236a8bad3503ff5eec3e04ac205a3b8e4fe)"
level=info ts=2021-10-13T14:54:54.056Z caller=main.go:369 build_context="(go=go1.15.6, user=root@d9f90f0b1f76, date=20210106-13:48:37)"
level=info ts=2021-10-13T14:54:54.056Z caller=main.go:370 host_details="(Linux 5.4.0-1057-aws #60~18.04.1-Ubuntu SMP Thu Sep 9 20:38:09 UTC 2021 x86_64 prometheus-rancher-monitoring-prometheus-0 (none))"
level=info ts=2021-10-13T14:54:54.056Z caller=main.go:371 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-10-13T14:54:54.056Z caller=main.go:372 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-10-13T14:54:54.061Z caller=web.go:530 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2021-10-13T14:54:54.061Z caller=main.go:738 msg="Starting TSDB ..."
level=info ts=2021-10-13T14:54:54.062Z caller=tls_config.go:192 component=web msg="TLS is disabled." http2=false
level=info ts=2021-10-13T14:54:54.063Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1634052626009 maxt=1634061600000 ulid=01FHVBR2KS2M28YRBFZ2X2571V
level=info ts=2021-10-13T14:54:54.063Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1634061600032 maxt=1634083200000 ulid=01FHW0B849ZJENV3N790QP8G24
level=info ts=2021-10-13T14:54:54.064Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1634104800000 maxt=1634112000000 ulid=01FHWEAK76X2EQ8BQTMPCRQ00D
level=info ts=2021-10-13T14:54:54.064Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1634112000000 maxt=1634119200000 ulid=01FHWN6DJTB9RQ39T0Q6E7N87Q
level=info ts=2021-10-13T14:54:54.065Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1634083200008 maxt=1634104800000 ulid=01FHWN6J4CJMSMDNZTB6X0TH3F
level=info ts=2021-10-13T14:54:54.065Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1634119200000 maxt=1634126400000 ulid=01FHWVWJTN31DMM5F67319WJXD
level=info ts=2021-10-13T14:54:54.159Z caller=head.go:645 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2021-10-13T14:54:54.347Z caller=head.go:659 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=187.918326ms
level=info ts=2021-10-13T14:54:54.347Z caller=head.go:665 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2021-10-13T14:54:56.812Z caller=head.go:691 component=tsdb msg="WAL checkpoint loaded"
level=info ts=2021-10-13T14:54:56.891Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=63 maxSegment=86
level=info ts=2021-10-13T14:54:57.236Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=64 maxSegment=86
level=info ts=2021-10-13T14:54:57.530Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=65 maxSegment=86
level=info ts=2021-10-13T14:54:57.831Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=66 maxSegment=86
level=info ts=2021-10-13T14:54:58.511Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=67 maxSegment=86
level=info ts=2021-10-13T14:54:59.328Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=68 maxSegment=86
level=info ts=2021-10-13T14:54:59.631Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=69 maxSegment=86
level=info ts=2021-10-13T14:54:59.928Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=70 maxSegment=86
level=info ts=2021-10-13T14:54:59.930Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=71 maxSegment=86
level=info ts=2021-10-13T14:55:00.224Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=72 maxSegment=86
level=info ts=2021-10-13T14:55:00.531Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=73 maxSegment=86
level=info ts=2021-10-13T14:55:00.926Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=74 maxSegment=86
level=info ts=2021-10-13T14:55:02.213Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=75 maxSegment=86
level=info ts=2021-10-13T14:55:02.415Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=76 maxSegment=86
level=info ts=2021-10-13T14:55:02.715Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=77 maxSegment=86
level=info ts=2021-10-13T14:55:02.916Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=78 maxSegment=86
level=info ts=2021-10-13T14:55:04.322Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=79 maxSegment=86
level=info ts=2021-10-13T14:55:04.728Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=80 maxSegment=86
level=info ts=2021-10-13T14:55:05.016Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=81 maxSegment=86
level=info ts=2021-10-13T14:55:05.414Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=82 maxSegment=86
level=info ts=2021-10-13T14:55:05.624Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=83 maxSegment=86
level=info ts=2021-10-13T14:55:06.022Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=84 maxSegment=86
level=info ts=2021-10-13T14:55:06.712Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=85 maxSegment=86
level=info ts=2021-10-13T14:55:06.713Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=86 maxSegment=86
level=info ts=2021-10-13T14:55:06.713Z caller=head.go:722 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=2.465477227s wal_replay_duration=9.900407411s total_replay_duration=12.553850778s
level=info ts=2021-10-13T14:55:07.816Z caller=main.go:758 fs_type=EXT4_SUPER_MAGIC
level=info ts=2021-10-13T14:55:07.816Z caller=main.go:761 msg="TSDB started"
level=info ts=2021-10-13T14:55:07.816Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2021-10-13T14:55:07.821Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-10-13T14:55:07.821Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-10-13T14:55:07.822Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-10-13T14:55:07.823Z caller=kubernetes.go:264 component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-10-13T14:55:07.823Z caller=kubernetes.go:264 component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
level=info ts=2021-10-13T14:55:07.930Z caller=main.go:918 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=113.322685ms remote_storage=2.104µs web_handler=533ns query_engine=981ns scrape=226.501µs scrape_sd=2.888992ms notify=21.976µs notify_sd=893.582µs rules=105.401968ms
level=info ts=2021-10-13T14:55:07.930Z caller=main.go:710 msg="Server is ready to receive web requests."

For anyone that has the same issue, the fix was to increase the memory limit on the Prometheus deployment