Help with getting NVIDIA COntainer Runtime Support running

Vin_B · October 1, 2023, 3:33pm

Hi,
I’m trying to play around with automatic NVIDIA container runtime detection on my personal k3s cluster (running a desktop ubuntu variant) as described in Advanced Options / Configuration | K3s. I followed the instructions . I also observed taht I can’t see any GPU resources when I describe the node (see below)

The install seems to have succeeded and I can exercise cuda from docker containers

pop-os:~/workspace/k3s-nvidia$ sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
vinayb@pop-os:~/workspace/k3s-nvidia$ 

vinayb@pop-os:~/workspace/k3s-nvidia$ kubectl describe runtimeclass nvidia
Name:         nvidia
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  node.k8s.io/v1
Handler:      nvidia
Kind:         RuntimeClass
Metadata:
  Creation Timestamp:  2023-09-24T05:51:42Z
  Resource Version:    675
  UID:                 bc0f32eb-8517-42af-9019-c2ee780b1feb
Events:                <none>

However, when I attempt to launch the demo pod, I get insufficient GPU . Is this bause I’m running desktop ubuntu with a UI - or is something else wrong? Any solutions / workarounds . I think I read I can’t request fractional GPU

vinayb@pop-os:~/workspace/k3s-nvidia$ kubectl get po
NAME                          READY   STATUS    RESTARTS   AGE
...
nbody-gpu-benchmark           0/1     Pending   0          6m18s

vinayb@pop-os:~/workspace/k3s-nvidia$ kubectl describe po nbody-gpu-benchmark
Name:                nbody-gpu-benchmark
Namespace:           default
Priority:            0
Runtime Class Name:  nvidia
Service Account:     default
Node:                <none>
Labels:              <none>
Annotations:         <none>
Status:              Pending
IP:                  
IPs:                 <none>
Containers:
  cuda-container:
    Image:      nvcr.io/nvidia/k8s/cuda-sample:nbody
    Port:       <none>
    Host Port:  <none>
    Args:
      nbody
      -gpu
      -benchmark
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:
      NVIDIA_VISIBLE_DEVICES:      all
      NVIDIA_DRIVER_CAPABILITIES:  all
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lnhgl (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-lnhgl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  6m32s  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
  Warning  FailedScheduling  64s    default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

vinayb@pop-os:~/workspace/k3s-nvidia$ kubectl get po nbody-gpu-benchmark -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"nbody-gpu-benchmark","namespace":"default"},"spec":{"containers":[{"args":["nbody","-gpu","-benchmark"],"env":[{"name":"NVIDIA_VISIBLE_DEVICES","value":"all"},{"name":"NVIDIA_DRIVER_CAPABILITIES","value":"all"}],"image":"nvcr.io/nvidia/k8s/cuda-sample:nbody","name":"cuda-container","resources":{"limits":{"nvidia.com/gpu":1}}}],"restartPolicy":"OnFailure","runtimeClassName":"nvidia"}}
  creationTimestamp: "2023-10-01T15:16:06Z"
  name: nbody-gpu-benchmark
  namespace: default
  resourceVersion: "6301"
  uid: 7f016e25-0bd4-4547-8d5e-a2e9c1518302
spec:
  containers:
  - args:
    - nbody
    - -gpu
    - -benchmark
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    imagePullPolicy: IfNotPresent
    name: cuda-container
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-lnhgl
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-lnhgl
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-10-01T15:16:06Z"
    message: '0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption:
      0/1 nodes are available: 1 No preemption victims found for incoming pod..'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: BestEffort

vinayb@pop-os:~/workspace/k3s-nvidia$ kubectl describe node
Name:               pop-os
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=k3s
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=pop-os
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=true
                    node-role.kubernetes.io/master=true
                    node.kubernetes.io/instance-type=k3s
Annotations:        flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"12:18:79:fc:d4:02"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 192.168.1.122
                    k3s.io/hostname: pop-os
                    k3s.io/internal-ip: 192.168.1.122,2603:8081:1411:3f39:5fb:15a2:5a80:4980
                    k3s.io/node-args: ["server"]
                    k3s.io/node-config-hash: FSKYKL5MB6SV7YY24RGBKRQGJIMBTFHBB36N4HSFP4X67IH35ZLQ====
                    k3s.io/node-env: {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/3dfc950bd39d2e2b435291ab8c1333aa6051fcaf46325aee898819f3b99d4b21"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 24 Sep 2023 00:47:40 -0500
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  pop-os
  AcquireTime:     <unset>
  RenewTime:       Sun, 01 Oct 2023 11:52:54 -0500
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Sun, 01 Oct 2023 11:50:22 -0500   Sun, 24 Sep 2023 00:47:40 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Sun, 01 Oct 2023 11:50:22 -0500   Sun, 24 Sep 2023 00:47:40 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Sun, 01 Oct 2023 11:50:22 -0500   Sun, 24 Sep 2023 00:47:40 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sun, 01 Oct 2023 11:50:22 -0500   Sun, 24 Sep 2023 00:47:40 -0500   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.1.122
  InternalIP:  2603:8081:1411:3f39:5fb:15a2:5a80:4980
  Hostname:    pop-os
Capacity:
  cpu:                8
  ephemeral-storage:  240891760Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16314104Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  234339503945
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16314104Ki
  pods:               110
System Info:
  Machine ID:                 2b48dd704000a021364361a764e62368
  System UUID:                0f6528c0-1730-11e1-8f3d-5404a6487a00
  Boot ID:                    7dc64dd8-d1b9-42ed-8722-29ca3002e943
  Kernel Version:             6.2.6-76060206-generic
  OS Image:                   Pop!_OS 22.04 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.6-k3s1.27
  Kubelet Version:            v1.27.6+k3s1
  Kube-Proxy Version:         v1.27.6+k3s1
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
ProviderID:                   k3s://pop-os
Non-terminated Pods:          (7 in total)
  Namespace                   Name                                      CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                      ------------  ----------  ---------------  -------------  ---
  kube-system                 local-path-provisioner-957fdf8bc-f4d64    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d11h
  kube-system                 coredns-77ccd57875-fmrnk                  100m (1%)     0 (0%)      70Mi (0%)        170Mi (1%)     7d11h
  kube-system                 svclb-traefik-7c2380ab-h7qcq              0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d11h
  kube-system                 traefik-64f55bb67d-swpvr                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d11h
  kube-system                 nvidia-device-plugin-daemonset-vct5b      0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d10h
  kube-system                 metrics-server-648b5df564-nxhls           100m (1%)     0 (0%)      70Mi (0%)        0 (0%)         7d11h
  default                     php-apache-5bdbb8dbf8-kgpkk               200m (2%)     500m (6%)   0 (0%)           0 (0%)         47s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                400m (5%)   500m (6%)
  memory             140Mi (0%)  170Mi (1%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:              <none>

Topic		Replies	Views
Beta season: NVidia-Docker2 & Rancher 2.0 ( K8s 1.9 ) Rancher 2.0 Tech Preview	1	2537	February 14, 2018
Running k3s on existing runtime k3s, k3OS, and k3d	0	613	March 22, 2022
Rancher not passing GPU to Plex POD Rancher	0	436	October 20, 2022
Running k3s one-node cluster with minimal storage k3s, k3OS, and k3d	0	1579	April 14, 2022
Is GPU support on Rancher K8S possible...? Rancher 1.x	0	1294	October 27, 2016

Help with getting NVIDIA COntainer Runtime Support running

Related topics