Hi,
I’m trying to play around with automatic NVIDIA container runtime detection on my personal k3s cluster (running a desktop ubuntu variant) as described in Advanced Options / Configuration | K3s. I followed the instructions . I also observed taht I can’t see any GPU resources when I describe the node (see below)
The install seems to have succeeded and I can exercise cuda from docker containers
pop-os:~/workspace/k3s-nvidia$ sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime"
vinayb@pop-os:~/workspace/k3s-nvidia$
vinayb@pop-os:~/workspace/k3s-nvidia$ kubectl describe runtimeclass nvidia
Name: nvidia
Namespace:
Labels: <none>
Annotations: <none>
API Version: node.k8s.io/v1
Handler: nvidia
Kind: RuntimeClass
Metadata:
Creation Timestamp: 2023-09-24T05:51:42Z
Resource Version: 675
UID: bc0f32eb-8517-42af-9019-c2ee780b1feb
Events: <none>
However, when I attempt to launch the demo pod, I get insufficient GPU . Is this bause I’m running desktop ubuntu with a UI - or is something else wrong? Any solutions / workarounds . I think I read I can’t request fractional GPU
vinayb@pop-os:~/workspace/k3s-nvidia$ kubectl get po
NAME READY STATUS RESTARTS AGE
...
nbody-gpu-benchmark 0/1 Pending 0 6m18s
vinayb@pop-os:~/workspace/k3s-nvidia$ kubectl describe po nbody-gpu-benchmark
Name: nbody-gpu-benchmark
Namespace: default
Priority: 0
Runtime Class Name: nvidia
Service Account: default
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
cuda-container:
Image: nvcr.io/nvidia/k8s/cuda-sample:nbody
Port: <none>
Host Port: <none>
Args:
nbody
-gpu
-benchmark
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_DRIVER_CAPABILITIES: all
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lnhgl (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-lnhgl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 6m32s default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
Warning FailedScheduling 64s default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
vinayb@pop-os:~/workspace/k3s-nvidia$ kubectl get po nbody-gpu-benchmark -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"nbody-gpu-benchmark","namespace":"default"},"spec":{"containers":[{"args":["nbody","-gpu","-benchmark"],"env":[{"name":"NVIDIA_VISIBLE_DEVICES","value":"all"},{"name":"NVIDIA_DRIVER_CAPABILITIES","value":"all"}],"image":"nvcr.io/nvidia/k8s/cuda-sample:nbody","name":"cuda-container","resources":{"limits":{"nvidia.com/gpu":1}}}],"restartPolicy":"OnFailure","runtimeClassName":"nvidia"}}
creationTimestamp: "2023-10-01T15:16:06Z"
name: nbody-gpu-benchmark
namespace: default
resourceVersion: "6301"
uid: 7f016e25-0bd4-4547-8d5e-a2e9c1518302
spec:
containers:
- args:
- nbody
- -gpu
- -benchmark
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
imagePullPolicy: IfNotPresent
name: cuda-container
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-lnhgl
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: OnFailure
runtimeClassName: nvidia
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-lnhgl
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-10-01T15:16:06Z"
message: '0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption:
0/1 nodes are available: 1 No preemption victims found for incoming pod..'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: BestEffort
vinayb@pop-os:~/workspace/k3s-nvidia$ kubectl describe node
Name: pop-os
Roles: control-plane,master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=k3s
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=pop-os
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=true
node-role.kubernetes.io/master=true
node.kubernetes.io/instance-type=k3s
Annotations: flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"12:18:79:fc:d4:02"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 192.168.1.122
k3s.io/hostname: pop-os
k3s.io/internal-ip: 192.168.1.122,2603:8081:1411:3f39:5fb:15a2:5a80:4980
k3s.io/node-args: ["server"]
k3s.io/node-config-hash: FSKYKL5MB6SV7YY24RGBKRQGJIMBTFHBB36N4HSFP4X67IH35ZLQ====
k3s.io/node-env: {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/3dfc950bd39d2e2b435291ab8c1333aa6051fcaf46325aee898819f3b99d4b21"}
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sun, 24 Sep 2023 00:47:40 -0500
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: pop-os
AcquireTime: <unset>
RenewTime: Sun, 01 Oct 2023 11:52:54 -0500
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Sun, 01 Oct 2023 11:50:22 -0500 Sun, 24 Sep 2023 00:47:40 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sun, 01 Oct 2023 11:50:22 -0500 Sun, 24 Sep 2023 00:47:40 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sun, 01 Oct 2023 11:50:22 -0500 Sun, 24 Sep 2023 00:47:40 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sun, 01 Oct 2023 11:50:22 -0500 Sun, 24 Sep 2023 00:47:40 -0500 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.1.122
InternalIP: 2603:8081:1411:3f39:5fb:15a2:5a80:4980
Hostname: pop-os
Capacity:
cpu: 8
ephemeral-storage: 240891760Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16314104Ki
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 234339503945
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16314104Ki
pods: 110
System Info:
Machine ID: 2b48dd704000a021364361a764e62368
System UUID: 0f6528c0-1730-11e1-8f3d-5404a6487a00
Boot ID: 7dc64dd8-d1b9-42ed-8722-29ca3002e943
Kernel Version: 6.2.6-76060206-generic
OS Image: Pop!_OS 22.04 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.6-k3s1.27
Kubelet Version: v1.27.6+k3s1
Kube-Proxy Version: v1.27.6+k3s1
PodCIDR: 10.42.0.0/24
PodCIDRs: 10.42.0.0/24
ProviderID: k3s://pop-os
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system local-path-provisioner-957fdf8bc-f4d64 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d11h
kube-system coredns-77ccd57875-fmrnk 100m (1%) 0 (0%) 70Mi (0%) 170Mi (1%) 7d11h
kube-system svclb-traefik-7c2380ab-h7qcq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d11h
kube-system traefik-64f55bb67d-swpvr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d11h
kube-system nvidia-device-plugin-daemonset-vct5b 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7d10h
kube-system metrics-server-648b5df564-nxhls 100m (1%) 0 (0%) 70Mi (0%) 0 (0%) 7d11h
default php-apache-5bdbb8dbf8-kgpkk 200m (2%) 500m (6%) 0 (0%) 0 (0%) 47s
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 400m (5%) 500m (6%)
memory 140Mi (0%) 170Mi (1%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>