GPU passthrough in Harvester

Slavik_Fursov · May 28, 2024, 2:29am

Does anyone have experience passing GPU to VM in Harvester?

I know that Harvester supports vGPU, but only few graphic card models support vGPU.
I have nVidia Quadro P5000 (Pascal) 16GB RAM and it doesn’t support vGPU.
I know, there are hacks and workaround. I tried few things, but was unable to figure out everything.

Here is my experience passing nVidia Quadro P5000:

Section 1 (not sure if needed):

SSH to Harvester with GPU

sudo su

# make sure the card is present on the host:
lspci | grep -i nvidia
0000:d5:00.0 VGA compatible controller: NVIDIA Corporation GP104GL [Quadro P5000] (rev a1)
0000:d5:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)

# prevent host from using GPU card:
# https://forums.unraid.net/topic/99478-solved-gpu-passthrough-issue-bar-0-cant-reserve/?do=findComment&comment=923145
echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

Section 2
in Harvester web UI:

Advanced → Addons → picdevices-controller → Enable
wait while all PCI devices are listed
enable passthrough for nvidia devices
create VM. I used Ubuntu 22 cloud image with 4 cores and 12GB RAM, no GUI, server image
add PCI devices. I only added “VGA compatible controller: NVIDIA Corporation GP104GL [Quadro P5000]”. I didin’t add “Audio device: NVIDIA Corporation GP104 High Definition Audio Controller”
start the VM:

Section 3
SSH to VM and run:

# confirm the GPU is present in VM:
lspci | grep -i nvidia
0a:00.0 VGA compatible controller: NVIDIA Corporation GP104GL [Quadro P5000] (rev a1)

Download nVidia drivers by following instruction here:

# see the status:
nvidia-smi

I also installed CUDA toolkit from here:

Currently running OpenAI Whisper. Works great.

Next I want to install Docker on that Ubuntu host and run Frigate and Immich.

I would be curious to hear your experience with supported vGPU on Harvester. I’m currently considering getting system with nVidia RTX 5000.

Or may be someone figured out how to run vGPU even on non-supported cards?

Another question: since my P5000 doesn’t support vGPU, should I try to install Ubuntu on that server bare-metal, install regular nVidia drivers and join Harvester cluster as a worker node with GPU support? Or that will not work?

Slavik_Fursov · June 12, 2024, 2:55am

After the node was rebooted, the nVidia PCI pass-through stopped working.

That’s because setting from Section 1 was gone.

In the kernel log I saw this:

dmesg -T
[Wed Jun 12 02:42:37 2024] vfio-pci 0000:b3:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[Wed Jun 12 02:42:37 2024] vfio-pci 0000:b3:00.0: BAR 1: can’t reserve [mem 0xe0000000-0xefffffff 64bit pref]
[Wed Jun 12 02:43:45 2024] vfio-pci 0000:b3:00.0: BAR 1: can’t reserve [mem 0xe0000000-0xefffffff 64bit pref]
…

So, it looks like Section 1 needs to be executed after every reboot. I need to figure out how to add these steps, so they’re executed automatically.

Felipe_De_Bene · November 20, 2024, 3:08am

@Slavik_Fursov have you figured out how to do this? your comments help me a lot.

Felipe_De_Bene · November 20, 2024, 5:01am

I mean in the persistence part specifically. Also I am curious if there’s a way to export some gpu metrics to prometheus. Have you exploded any of that?

Slavik_Fursov · November 30, 2024, 5:47pm

One reason why I have not done persistence is to prevent myself from locking out.

Think about it:

you enable persistence
host started to have some issues / need to fix things via terminal
you attach monitor, keyboard … and you can’t see anything because GPU is disconnected from host OS. Reboot will not help, because persistence enabled.

Anyway, I think to enable persistence you need to apply cloud-init config like this (modify hostname):

For nVidia Quadro P5000:

# kubectl apply -f VM-P5000/release-p5000.yaml

apiVersion: node.harvesterhci.io/v1beta1
kind: CloudInit
metadata:
  name: release-gpu
spec:
  matchSelector:
    kubernetes.io/hostname: "t7920"
  filename: 99_p5000.yaml
  contents: |
    stages:
      network:
      - name: "disconnect GPU from host OS"
        commands:
          - echo "disconnecting GPU from host OS" > /dev/kmsg
          - echo 0 > /sys/class/vtconsole/vtcon0/bind
          - echo 0 > /sys/class/vtconsole/vtcon1/bind
          - echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

For nVidia RTX 3090:

apiVersion: node.harvesterhci.io/v1beta1
kind: CloudInit
metadata:
  name: release-gpu
spec:
  matchSelector:
    kubernetes.io/hostname: "t5820"
  filename: 99_rtx3090.yaml
  contents: |
    stages:
      network:
      - name: "disconnect GPU from host OS"
        commands:
          - echo "disconnecting GPU from host OS" > /dev/kmsg
          - echo 0 > /sys/class/vtconsole/vtcon0/bind

I have not tested that. So, let me know here, how that works…

Topic		Replies	Views
Windows VM Doesn't boot with vGPU Harvester	1	51	July 18, 2024
Rancher not passing GPU to Plex POD Rancher	0	438	October 20, 2022
ImagePull Problems With Nvidia Toolkit Harvester	1	57	July 25, 2024
Can I manually add node to Harvester cluster? Harvester	2	118	August 9, 2024
VLAN setup with single NIC per server? Harvester	0	1299	December 7, 2021

GPU passthrough in Harvester

Related topics