GRPC cross-cluster routing setup

SebManGK · November 12, 2021, 5:37pm

Hello all,

I’m struggeling with exposing the Thanos-sidecar via GRPC deployed with monitoring-chart for a Thanos-querier located in another cluster. All clusters are managed by Rancher with its default nginx-ingress-controller and we’re using Cloudflare for DNS and the default ingress-certificates.

The error I can see in the ingress-controller-logs when trying to access the host-address of the ingress via browser:

[error] upstream rejected request with error 2 while reading response header from upstream, client: , server: thanos-sidecar-clusterXYZ.domain.com, request: “GET / HTTP/2.0”, upstream: “grpc://10.42.5.138:10901”, host: “thanos-sidecar-clusterXYZ.domain.com”
12/Nov/2021:17:06:22 +0000 [source IP: ], server: thanos-sidecar-clusterXYZ.domain.com, method: GET, uri: /, request_filename: /usr/local/nginx/html/, bytes_sent: 659, request_time: 0.001, status: 502, request_proto: HTTP/2.0, duration: 0.001, http_user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36
2021/11/12 17:06:22

Manifests of the cluster with the sidecar:

apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: rancher-monitoring
    meta.helm.sh/release-namespace: cattle-monitoring-system
spec:
  clusterIP: 10.43.202.158
  clusterIPs:
  - 10.43.202.158
  ports:
  - name: grpc
    port: 10901
    protocol: TCP
    targetPort: 10901
  selector:
    app.kubernetes.io/name: prometheus
    prometheus: rancher-monitoring-prometheus
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
    meta.helm.sh/release-name: rancher-monitoring
    meta.helm.sh/release-namespace: cattle-monitoring-system
    nginx.ingress.kubernetes.io/backend-protocol: grpc
    rancher.io/globalDNS.hostname: thanos-sidecar-clusterXYZ.domain.com
spec:
  rules:
  - host: thanos-sidecar-clusterXYZ.domain.com
    http:
      paths:
      - backend:
          serviceName: rancher-monitoring-thanos-external
          servicePort: 10901
        path: /
        pathType: ImplementationSpecific
  tls:
  - hosts:
    - thanos-sidecar-clusterXYZ.domain.com

In the Thanos-querier-cluster with querier-config:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    meta.helm.sh/release-name: thanos
    meta.helm.sh/release-namespace: thanos
spec:
    spec:
      containers:
      - args:
        - query
        - --log.level=debug
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --query.replica-label=replica
        - --store=dnssrv+_grpc._tcp.rancher-monitoring-thanos-discovery.cattle-monitoring-system.svc.cluster.local
        - --store=dnssrv+_grpc._tcp.thanos-storegateway.thanos.svc.cluster.local
        - --store=thanos-sidecar-clusterXYZ.domain.com:10901
        name: query
        ports:
        - containerPort: 10902
          name: http
          protocol: TCP
        - containerPort: 10901
          name: grpc
          protocol: TCP

there just occure:

level=warn caller=endpointset.go:525 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from thanos-sidecar-clusterXYZ.domain.com:10901: rpc error: code = Unavailable desc = connection closed" address=thanos-sidecar-clusterXYZ.domain.com:10901

Can someone point me into the right direction or has any hints to setup the connection properly? Or has somebody in general thoughts or experience how to handle the cross-cluster GRPC-communication with Rancher + Cloudflare + its certificates?

crdnb · January 14, 2022, 2:03pm

Hi @SebManGK

I’m facing the same problem. Did you find a solution?

Many thanks and best regards!

triple_B · January 14, 2022, 4:40pm

Am also having the same issue. any solution about this?

SebManGK · January 17, 2022, 6:10am

Hi guys,

we found out that the problem for our setup (we use Helm-charts) was, that the Thanos-query-component seems not to be able to discover cluster-internal and -external stores with one single setup:

github.com/thanos-io/thanos

Querier cannot speak to external sidecar

opened 12:27PM - 04 Aug 20 UTC

closed 02:23PM - 24 Jan 21 UTC

atamgp

stale

**Thanos, Prometheus and Golang version used**: thanos: charts.bitnami.com/bitnami 2.0.0 prometheus-operator with sidecar: charts.bitnami.com/bitnami0.22.3 sidecar image: 0.14.0-scratch-r3 **What happened**: I have 2 eks clusters both running prometheus-operator. 1 cluster has thanos installed (above chart) and needs querier to add the sidecar of the other cluster: Thanos is registering its local sidecar, the storage gateway but **not the external sidecar**: cluster 1 (querier) -> aws route 55 (dns) -> nlb loadbalancer -> nginx ingress -> service -> prometheus pod -> sidecar port 10901 **What you expected to happen**: For the external sidecar to be added to stores in Querie **Helpful facts**: From my laptop I can **grpcurl list** the side-car on port 443 which shows the route to work. response: `Failed to list services: server does not support the reflection API` I had to enable ALPN policy (HTTP2Optional) on the NLB to make this work. This ping also works from a busybox pod in the **Thanos cluster** in the **same ns as querier**. These show up in the nginx logs where I **cannot find entries coming from querier...** **How to reproduce it (as minimally and precisely as possible)**: have 2 clusters, try to add the sidecar from 1 to the querier in the other. Use nginx ingress for this. **Full logs to relevant components**: **Querier:** `level=warn ts=2020-08-04T10:33:05.274486641Z caller=storeset.go:487 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from XXXXXXXXX:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=XXXXXXXXXX:443` **Sidecar:** ``` level=info ts=2020-08-04T07:48:09.710243084Z caller=main.go:151 msg="Tracing will be disabled" level=info ts=2020-08-04T07:48:09.710624498Z caller=options.go:23 protocol=gRPC msg="disabled TLS, key and cert must be set to enable" level=info ts=2020-08-04T07:48:09.710771643Z caller=grpc.go:92 service=gRPC/server component=sidecar msg="registering as gRPC StoreAPI and RulesAPI" level=info ts=2020-08-04T07:48:09.71109473Z caller=factory.go:46 msg="loading bucket configuration" level=info ts=2020-08-04T07:48:09.711510731Z caller=sidecar.go:301 msg="starting sidecar" level=info ts=2020-08-04T07:48:09.711645564Z caller=reloader.go:198 component=reloader msg="started watching config file and non-recursively rule dirs for changes" cfg= out= dirs= level=info ts=2020-08-04T07:48:09.712128709Z caller=intrumentation.go:48 msg="changing probe status" status=ready level=info ts=2020-08-04T07:48:09.712603212Z caller=grpc.go:119 service=gRPC/server component=sidecar msg="listening for serving gRPC" address=0.0.0.0:10901 level=info ts=2020-08-04T07:48:09.712795694Z caller=intrumentation.go:60 msg="changing probe status" status=healthy level=info ts=2020-08-04T07:48:09.712814177Z caller=http.go:56 service=http/server component=sidecar msg="listening for requests and metrics" address=0.0.0.0:10902 level=info ts=2020-08-04T07:48:14.724274654Z caller=sidecar.go:163 msg="successfully loaded prometheus external labels" external_labels="{cluster=\"tcloud-cloudworks\", prometheus=\"kube-system/prometheus-operator-prometheus\", prometheus_replica=\"prometheus-prometheus-operator-prometheus-0\"}" level=info ts=2020-08-04T07:48:14.724382853Z caller=intrumentation.go:48 msg="changing probe status" status=ready level=info ts=2020-08-04T09:00:16.785707212Z caller=shipper.go:361 msg="upload new block" id=01EEWB5VEM0YRJVFH72RF0TY34 level=info ts=2020-08-04T11:00:16.728795796Z caller=shipper.go:361 msg="upload new block" id=01EEWJ1JPKVYG482MRGGTFJAFV ``` **Anything else we need to know**: **Querier config:** ``` - query - --log.level=debug - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:10902 - --query.replica-label=replica - --store=dnssrv+_grpc._tcp.thanos-storegateway.kube-system.svc.cluster.local - --store=dnssrv+_grpc._tcp.prometheus-operator-prometheus-thanos.kube-system.svc - --store=dns+thanos-sidecar.DNSNAME:443 image: docker.io/bitnami/thanos:0.14.0-scratch-r1 ``` Changing `- --store=dns+thanos-sidecar.DNSNAME:443` to `- --store=thanos-sidecar.DNSNAME:443` did not have any effect. **Sidecar config:** ``` - sidecar - --prometheus.url=http://localhost:9090 - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:10902 - --tsdb.path=/prometheus/ - --objstore.config=$(OBJSTORE_CONFIG) - --log.level=info ``` **Sidecar Nginx/Ingress config** We are terminating **tls** in NLB ``` apiVersion: extensions/v1beta1 kind: Ingress metadata: annotations: nginx.ingress.kubernetes.io/backend-protocol: GRPC nginx.ingress.kubernetes.io/proxy-connect-timeout: 360s nginx.ingress.kubernetes.io/proxy-read-timeout: 360s nginx.ingress.kubernetes.io/proxy-send-timeout: 360s creationTimestamp: "2020-07-30T08:58:56Z" generation: 7 name: thanos-sidecar namespace: kube-system resourceVersion: "2043742" selfLink: /apis/extensions/v1beta1/namespaces/kube-system/ingresses/thanos-sidecar uid: db27bbe0-4e60-460c-b841-13000a63e8f3 spec: rules: - host: thanos-sidecar.DNSNAME http: paths: - backend: serviceName: thanos-sidecar servicePort: 10901 status: loadBalancer: ingress: - ip: XXXX - ip: XXXX - ip: XXXX ```

github.com/thanos-io/thanos

thanos+ingress-nginx+grpc: impossible setup due missing host header

opened 03:06PM - 10 Sep 19 UTC

closed 09:26PM - 12 Oct 21 UTC

danielmotaleite

help wanted stale

**Thanos, Prometheus and Golang version used** quay.io/thanos/thanos:v0.7.0 …**What happened** i setup 2 kubernetes clusters, thanos query is in one cluster (and a local prometheus+sidecar) and need to query the remote kubernetes cluster thanos sidecar, all running in AWS (but not using eks) I created one ingress-nginx with support for grpc with this config: ``` --- apiVersion: extensions/v1beta1 kind: Ingress metadata: name: monitoring-ingress namespace: monitoring annotations: kubernetes.io/ingress.class: "nginx" spec: rules: - host: prometheus-k8s-live-a.ops.example.com http: paths: - path: / backend: serviceName: prometheus-k8s-live-a servicePort: 9090 - host: prometheus-k8s-live-b.ops.example.com http: paths: - path: / backend: serviceName: prometheus-k8s-live-b servicePort: 9090 tls: - hosts: - prometheus-k8s-live-a.ops.example.com - prometheus-k8s-live-b.ops.example.com --- apiVersion: extensions/v1beta1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: "nginx" nginx.ingress.kubernetes.io/ssl-redirect: "true" nginx.ingress.kubernetes.io/backend-protocol: "GRPC" name: grpc-ingress namespace: monitoring spec: rules: - host: sidecar-k8s-live-a.ops.example.com http: paths: - backend: serviceName: sidecar-k8s-live-a servicePort: 10911 - host: sidecar-k8s-live-b.ops.example.com http: paths: - backend: serviceName: sidecar-k8s-live-b servicePort: 10911 tls: - hosts: - sidecar-k8s-live-a.ops.example.com - sidecar-k8s-live-b.ops.example.com ``` thanos query is using ``` --store=sidecar-k8s-live-a.ops.example.com.:443 --store=sidecar-k8s-live-a.ops.example.com.:443 ``` I can connect to the prometheus url, but the sidecar grpc fail in thanos query. looking to the nginx logs i can see the query arriving in http2, but returning 400. Doing a curl i can get a 503, but probably just because it is not really grpc. Changing the ingress-nginx logs to show the host header, i can see that curl is sending the correct host header, but for thanos query the logs show only `_`, it is either sending a empty one or a `_`. **What you expected to happen** I wanted to share the ingress to receive the https requests for prometheus and the grpc and using the host to redirect the request to the correct service. Sadly thanos query fail to send the host header and so the nginx can't apply the virtual_host search and servers the request from the default site. **Full logs to relevant components** <details>Logs <p> ``` 172.27.119.135 - [172.27.119.135] - - [10/Sep/2019:15:02:40 +0000] "PRI * HTTP/2.0" 400 163 "-" "-" 0 0.001 [] [] - - - - 477873c7a336618ccf06cf9c03fe8d97 172.27.119.135 - [172.27.119.135] - - [10/Sep/2019:15:02:40 +0000] "PRI * HTTP/2.0" 400 163 "-" "-" 0 0.003 [] [] - - - - c32e68975e91159a64326b55d4b72934 2019/09/10 15:02:40 [error] 1137#1137: *7155 upstream rejected request with error 2 while reading response header from upstream, client: 172.26.81.74, server: sidecar-k8s-live-a.ops.example.com, request: "PRI / HTTP/1.1", upstream: "grpc://100.96.136.200:10911", host: "sidecar-k8s-live-a.ops.example.com" 172.26.81.74 - [172.26.81.74] - - [10/Sep/2019:15:02:40 +0000] "PRI / HTTP/1.1" 502 163 "-" "curl/7.58.0" 189 0.002 [monitoring-sidecar-k8s-live-a-10911] [] 100.96.136.200:10911 0 0.004 502 4e08c4e8c6d8df148c5bc3a68d61ccf9 ``` here we can see that the thanos query requests do not trigger the virtual_host, but the curl one, with host, is redirected to thanos sidecar </p> </details>

We got it working on our infrastructure by splitting the Thanos-components:

on the observer-cluster setup queriers for each observee-cluster (external thanos-sidecar) with

extraFlags:
  - '--grpc-client-tls-secure'

and

dnsDiscovery:
  enabled: false

on the observer-cluster setup a main-querier without this flag and

dnsDiscovery:
  enabled: true

(for getting the cluster-internal Thanos-sidecar discovered)

setup on this main-querier the store-targets of each oberservee-cluster-querier:

stores:
  - dnssrv+_grpc._tcp.thanos-querier-observee-cluster-one.thanos.svc.cluster.local
  - dnssrv+_grpc._tcp.thanos-querier-observee-cluster-two.thanos.svc.cluster.local

Topic		Replies	Views
Looking for a tutorial installing loadbalancer Rancher	7	478	December 13, 2023
Nodeport/proxy/loadbalancer Rancher	1	678	January 15, 2021
Ingress controller 500 Rancher	2	1011	August 6, 2019
Making an entrypoint for a multicluster app Rancher	0	327	January 8, 2021
Where should you resolve requests for Ingress? Rancher	1	2197	July 10, 2019

GRPC cross-cluster routing setup

Related topics