Is the Pod Resources API disabled on Rancher Kubernetes Engine?

Problem Summary:

We’re using DCGM Exporter to collect metrics about GPU workloads. When deployed on GKE, the exporter does not return GPU information about other pods or containers (when it’s expected to return that information).

This exporter runs a replica on every node and queries the Pod Resource API exposed by the kubelet to get the needed data. It seems that on RKE (and also GKE), this API is disabled or configured differently compared to other kubernetes distributions.

Problem Demonstration:

Our test scenario consists of deploying a one-node cluster with dcgm-exporter running on it along with a one-replica deployment (called cuda-test in this demo) that uses GPU resources.

We query the exporter through its /metrics endpoint, and the results are as follows.

When running on rancher k3s v1.20.4+k3s1 , the container and pod labels contain a value:

dcgm_sm_clock{gpu="0",UUID="GPU-a2bf9768-0411-f0bb-791c-67d5fec65e2f",device="nvidia0",Hostname="dcgm-exporter-xzpt9",container="cuda-test-main",namespace="default",pod="cuda-test-687bddf45c-qjl6x"} 1860
dcgm_memory_clock{gpu="0",UUID="GPU-a2bf9768-0411-f0bb-791c-67d5fec65e2f",device="nvidia0",Hostname="dcgm-exporter-xzpt9",container="cuda-test-main",namespace="default",pod="cuda-test-687bddf45c-qjl6x"} 9501
dcgm_gpu_temp{gpu="0",UUID="GPU-a2bf9768-0411-f0bb-791c-67d5fec65e2f",device="nvidia0",Hostname="dcgm-exporter-xzpt9",container="cuda-test-main",namespace="default",pod="cuda-test-687bddf45c-qjl6x"} 41

But when running on RKE v1.19.6-rancher1-1, the container and pod labels both have empty values:

dcgm_sm_clock{gpu="0",UUID="GPU-31275fe8-8f4e-e2d2-b057-43e5d88b6965",device="nvidia0",Hostname="dcgm-exporter-tzmgg",container="",namespace="",pod=""} 1410
dcgm_memory_clock{gpu="0",UUID="GPU-31275fe8-8f4e-e2d2-b057-43e5d88b6965",device="nvidia0",Hostname="dcgm-exporter-tzmgg",container="",namespace="",pod=""} 1215
dcgm_gpu_temp{gpu="0",UUID="GPU-31275fe8-8f4e-e2d2-b057-43e5d88b6965",device="nvidia0",Hostname="dcgm-exporter-tzmgg",container="",namespace="",pod=""} 54

I haven’t been able to find any information about whether RKE (or GKE) disables this API (which was introduced in k8s 1.13) or restricts certain values from being exposed. I’d like to learn more about the matter and find a solution in order for the exporter to access and collect the information.

It seems that adding the argument --kubernetes-gpu-id-type device-name to the dcgm-exporter does it.

On GKE, apparently, the pod resource API does not return a value for the GPU’s uid (which is the default value for the argument), and that’s why the information about the pods and containers was getting skipped.