Problem Summary:
We’re using DCGM Exporter to collect metrics about GPU workloads. When deployed on GKE, the exporter does not return GPU information about other pods or containers (when it’s expected to return that information).
This exporter runs a replica on every node and queries the Pod Resource API exposed by the kubelet to get the needed data. It seems that on RKE (and also GKE), this API is disabled or configured differently compared to other kubernetes distributions.
Problem Demonstration:
Our test scenario consists of deploying a one-node cluster with dcgm-exporter
running on it along with a one-replica deployment (called cuda-test
in this demo) that uses GPU resources.
We query the exporter through its /metrics
endpoint, and the results are as follows.
When running on rancher k3s v1.20.4+k3s1
, the container
and pod
labels contain a value:
dcgm_sm_clock{gpu="0",UUID="GPU-a2bf9768-0411-f0bb-791c-67d5fec65e2f",device="nvidia0",Hostname="dcgm-exporter-xzpt9",container="cuda-test-main",namespace="default",pod="cuda-test-687bddf45c-qjl6x"} 1860
dcgm_memory_clock{gpu="0",UUID="GPU-a2bf9768-0411-f0bb-791c-67d5fec65e2f",device="nvidia0",Hostname="dcgm-exporter-xzpt9",container="cuda-test-main",namespace="default",pod="cuda-test-687bddf45c-qjl6x"} 9501
dcgm_gpu_temp{gpu="0",UUID="GPU-a2bf9768-0411-f0bb-791c-67d5fec65e2f",device="nvidia0",Hostname="dcgm-exporter-xzpt9",container="cuda-test-main",namespace="default",pod="cuda-test-687bddf45c-qjl6x"} 41
But when running on RKE v1.19.6-rancher1-1
, the container
and pod
labels both have empty values:
dcgm_sm_clock{gpu="0",UUID="GPU-31275fe8-8f4e-e2d2-b057-43e5d88b6965",device="nvidia0",Hostname="dcgm-exporter-tzmgg",container="",namespace="",pod=""} 1410
dcgm_memory_clock{gpu="0",UUID="GPU-31275fe8-8f4e-e2d2-b057-43e5d88b6965",device="nvidia0",Hostname="dcgm-exporter-tzmgg",container="",namespace="",pod=""} 1215
dcgm_gpu_temp{gpu="0",UUID="GPU-31275fe8-8f4e-e2d2-b057-43e5d88b6965",device="nvidia0",Hostname="dcgm-exporter-tzmgg",container="",namespace="",pod=""} 54
I haven’t been able to find any information about whether RKE (or GKE) disables this API (which was introduced in k8s 1.13
) or restricts certain values from being exposed. I’d like to learn more about the matter and find a solution in order for the exporter to access and collect the information.