Is the Pod Resources API disabled on Rancher Kubernetes Engine?

ash · May 6, 2021, 11:42am

Problem Summary:

We’re using DCGM Exporter to collect metrics about GPU workloads. When deployed on GKE, the exporter does not return GPU information about other pods or containers (when it’s expected to return that information).

This exporter runs a replica on every node and queries the Pod Resource API exposed by the kubelet to get the needed data. It seems that on RKE (and also GKE), this API is disabled or configured differently compared to other kubernetes distributions.

Problem Demonstration:

Our test scenario consists of deploying a one-node cluster with dcgm-exporter running on it along with a one-replica deployment (called cuda-test in this demo) that uses GPU resources.

We query the exporter through its /metrics endpoint, and the results are as follows.

When running on rancher k3s v1.20.4+k3s1 , the container and pod labels contain a value:

dcgm_sm_clock{gpu="0",UUID="GPU-a2bf9768-0411-f0bb-791c-67d5fec65e2f",device="nvidia0",Hostname="dcgm-exporter-xzpt9",container="cuda-test-main",namespace="default",pod="cuda-test-687bddf45c-qjl6x"} 1860
dcgm_memory_clock{gpu="0",UUID="GPU-a2bf9768-0411-f0bb-791c-67d5fec65e2f",device="nvidia0",Hostname="dcgm-exporter-xzpt9",container="cuda-test-main",namespace="default",pod="cuda-test-687bddf45c-qjl6x"} 9501
dcgm_gpu_temp{gpu="0",UUID="GPU-a2bf9768-0411-f0bb-791c-67d5fec65e2f",device="nvidia0",Hostname="dcgm-exporter-xzpt9",container="cuda-test-main",namespace="default",pod="cuda-test-687bddf45c-qjl6x"} 41

But when running on RKE v1.19.6-rancher1-1, the container and pod labels both have empty values:

dcgm_sm_clock{gpu="0",UUID="GPU-31275fe8-8f4e-e2d2-b057-43e5d88b6965",device="nvidia0",Hostname="dcgm-exporter-tzmgg",container="",namespace="",pod=""} 1410
dcgm_memory_clock{gpu="0",UUID="GPU-31275fe8-8f4e-e2d2-b057-43e5d88b6965",device="nvidia0",Hostname="dcgm-exporter-tzmgg",container="",namespace="",pod=""} 1215
dcgm_gpu_temp{gpu="0",UUID="GPU-31275fe8-8f4e-e2d2-b057-43e5d88b6965",device="nvidia0",Hostname="dcgm-exporter-tzmgg",container="",namespace="",pod=""} 54

I haven’t been able to find any information about whether RKE (or GKE) disables this API (which was introduced in k8s 1.13) or restricts certain values from being exposed. I’d like to learn more about the matter and find a solution in order for the exporter to access and collect the information.

ash · May 6, 2021, 6:08pm

It seems that adding the argument --kubernetes-gpu-id-type device-name to the dcgm-exporter does it.

On GKE, apparently, the pod resource API does not return a value for the GPU’s uid (which is the default value for the argument), and that’s why the information about the pods and containers was getting skipped.

Topic		Replies	Views
Monitoring - no metrics for pods in 1.24 Rancher 2.x	1	938	December 23, 2022
Is GPU support on Rancher K8S possible...? Rancher 1.x	0	1261	October 27, 2016
Issues with metrics feeding in to HPA/AutoScaling Rancher 2.0 Tech Preview	0	1490	April 18, 2018
Pods.metrics.k8s.io is forbidden: User "system:kube-proxy" Rancher 2.x	2	4778	October 24, 2018
Rancher REST API and RKE2 Rancher 2.x	2	550	April 7, 2023

Is the Pod Resources API disabled on Rancher Kubernetes Engine?

Problem Summary:

Problem Demonstration:

Related Topics