GRPC cross-cluster routing setup

Hello all,

I’m struggeling with exposing the Thanos-sidecar via GRPC deployed with monitoring-chart for a Thanos-querier located in another cluster. All clusters are managed by Rancher with its default nginx-ingress-controller and we’re using Cloudflare for DNS and the default ingress-certificates.

The error I can see in the ingress-controller-logs when trying to access the host-address of the ingress via browser:

[error] upstream rejected request with error 2 while reading response header from upstream, client: , server: thanos-sidecar-clusterXYZ.domain.com, request: “GET / HTTP/2.0”, upstream: “grpc://10.42.5.138:10901”, host: “thanos-sidecar-clusterXYZ.domain.com
12/Nov/2021:17:06:22 +0000 [source IP: ], server: thanos-sidecar-clusterXYZ.domain.com, method: GET, uri: /, request_filename: /usr/local/nginx/html/, bytes_sent: 659, request_time: 0.001, status: 502, request_proto: HTTP/2.0, duration: 0.001, http_user_agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36
2021/11/12 17:06:22

Manifests of the cluster with the sidecar:

apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: rancher-monitoring
    meta.helm.sh/release-namespace: cattle-monitoring-system
spec:
  clusterIP: 10.43.202.158
  clusterIPs:
  - 10.43.202.158
  ports:
  - name: grpc
    port: 10901
    protocol: TCP
    targetPort: 10901
  selector:
    app.kubernetes.io/name: prometheus
    prometheus: rancher-monitoring-prometheus
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
    meta.helm.sh/release-name: rancher-monitoring
    meta.helm.sh/release-namespace: cattle-monitoring-system
    nginx.ingress.kubernetes.io/backend-protocol: grpc
    rancher.io/globalDNS.hostname: thanos-sidecar-clusterXYZ.domain.com
spec:
  rules:
  - host: thanos-sidecar-clusterXYZ.domain.com
    http:
      paths:
      - backend:
          serviceName: rancher-monitoring-thanos-external
          servicePort: 10901
        path: /
        pathType: ImplementationSpecific
  tls:
  - hosts:
    - thanos-sidecar-clusterXYZ.domain.com

In the Thanos-querier-cluster with querier-config:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    meta.helm.sh/release-name: thanos
    meta.helm.sh/release-namespace: thanos
spec:
    spec:
      containers:
      - args:
        - query
        - --log.level=debug
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --query.replica-label=replica
        - --store=dnssrv+_grpc._tcp.rancher-monitoring-thanos-discovery.cattle-monitoring-system.svc.cluster.local
        - --store=dnssrv+_grpc._tcp.thanos-storegateway.thanos.svc.cluster.local
        - --store=thanos-sidecar-clusterXYZ.domain.com:10901
        name: query
        ports:
        - containerPort: 10902
          name: http
          protocol: TCP
        - containerPort: 10901
          name: grpc
          protocol: TCP

there just occure:

level=warn caller=endpointset.go:525 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from thanos-sidecar-clusterXYZ.domain.com:10901: rpc error: code = Unavailable desc = connection closed" address=thanos-sidecar-clusterXYZ.domain.com:10901

Can someone point me into the right direction or has any hints to setup the connection properly? Or has somebody in general thoughts or experience how to handle the cross-cluster GRPC-communication with Rancher + Cloudflare + its certificates?

Hi @SebManGK

I’m facing the same problem. Did you find a solution?

Many thanks and best regards!

Am also having the same issue. any solution about this?

Hi guys,

we found out that the problem for our setup (we use Helm-charts) was, that the Thanos-query-component seems not to be able to discover cluster-internal and -external stores with one single setup:

We got it working on our infrastructure by splitting the Thanos-components:

  • on the observer-cluster setup queriers for each observee-cluster (external thanos-sidecar) with
extraFlags:
  - '--grpc-client-tls-secure'

and

dnsDiscovery:
  enabled: false
  • on the observer-cluster setup a main-querier without this flag and
dnsDiscovery:
  enabled: true

(for getting the cluster-internal Thanos-sidecar discovered)

  • setup on this main-querier the store-targets of each oberservee-cluster-querier:
stores:
  - dnssrv+_grpc._tcp.thanos-querier-observee-cluster-one.thanos.svc.cluster.local
  - dnssrv+_grpc._tcp.thanos-querier-observee-cluster-two.thanos.svc.cluster.local