Setting Up NVIDIA vGPU Devices

To enhance the AI capabilities of your virtual machine (VMs) on a RHEL host, you can create multiple vGPU profiles from a physical GPU, and assign these devices to multiple guests. This is supported on select NVIDIA GPUs with RHEL KVM Virtualization, and only one mediated device can be assigned to a single guest.

Note

NVIDIA AI Enterprise supports up to a maximum of 16 vGPUs per VM on Red Hat Enterprise Linux with KVM

Note

Red Hat Enterprise Linux guest OS support is limited to running containers by using Docker without Kubernetes. NVIDIA AI Enterprise features that depend on Kubernetes, for example, the use of GPU Operator, are not supported on Red Hat Enterprise Linux.

NVIDIA vGPU technology makes it possible to divide a physical NVIDIA GPU device into multiple virtual devices. These mediated devices can then be assigned to multiple VMs as virtual GPUs. As a result, these VMs can share the performance of a single physical GPU.

Important

Assigning a physical GPU to VMs, with or without using vGPU devices, makes it impossible for the host to use the GPU.

To setup the NVIDIA vGPU feature, please download NVIDIA vGPU drivers for your GPU device, create mediated devices, and assign them to the intended virtual machines. For detailed instructions, see below.

Note

Please refer here for a list of supported GPUs for utilizing NVIDIA vGPU with RHEL KVM.

If you do not know which GPU your host is using, install the lshw package and use the lshw -C display command. The following example shows the system is using an NVIDIA A100 GPU, compatible with vGPU.

Copy
Copied!
            

# lshw -C display

rhel-vgpu-01.png

Now that we have verified the presence of an NVIDIA GPU on the host, we will install the NVIDIA AI Enterprise Guest driver within the VM to fully enable GPU operation.

NVIDIA Driver

The NVIDIA driver is the software driver that is installed on the OS and is responsible for communicating with the NVIDIA GPU.

NVIDIA AI Enterprise drivers are available by either downloading them from the NVIDIA Enterprise Licensing Portal, the NVIDIA Download Drivers web page, or pulling them from NGC Catalog. Please review the NVIDIA AI Enterprise Quick Start Guide for more details regarding licensing entitlement certificates.

Installing the NVIDIA Driver using CLS

This section will cover the steps required to properly install, configure, and license the NVIDIA driver for CLS users.

Now that you have installed Linux, the NVIDIA AI Enterprise Driver will fully enable GPU operation. Before proceeding with the NVIDIA Driver installation, please confirm that Nouveau is disabled. Instructions to confirm this are located in the RHEL section.

Downloading the NVIDIA AI Enterprise Software Driver Using NGC

Important

Before you begin you will need to generate or use an existing API key.

  1. From a browser, go to https://ngc.nvidia.com/signin/email and then enter your email and password.

  2. In the top right corner, click your user account icon and select Setup.

  3. Click Get API Key to open the Setup > API Key page.

    Note

    The API Key is the mechanism used to authenticate your access to the NGC container registry.


  4. Click Generate API Key to generate your API key.

    Note

    A warning message appears to let you know that your old API key will become invalid if you create a new key.


  5. Click Confirm to generate the key.

  6. Your API key appears.

    Important

    You only need to generate an API Key once. NGC does not save your key, so store it in a secure place. (You can copy your API Key to the clipboard by clicking the copy icon to the right of the API key.)Should you lose your API Key, you can generate a new one from the NGC website. When you generate a new API Key, the old one is invalidated.


    1. Run the following commands to install the NGC CLI for AMD64

    AMD64 Linux Install: The NGC CLI binary for Linux is supported on Ubuntu 16.04 and later distributions.

    • Download, unzip, and install from the command line by moving to a directory where you have execute permissions and then running the following command:

    Copy
    Copied!
                

    wget --content-disposition https://ngc.nvidia.com/downloads/ngccli_linux.zip && unzip ngccli_linux.zip && chmod u+x ngc-cli/ngc

    Note

    The NGC CLI installations for Windows NGC CLI, Arm64 MacOs, or Intel MacOs can be found here

    • Check the binary’s MD5 hash to ensure the file wasn’t corrupted during download.

    Copy
    Copied!
                

    $ md5sum -c ngc.md5

    • Add your current directory to path.

    Copy
    Copied!
                

    $ echo "export PATH=\"\$PATH:$(pwd)\"" >> ~/.bash_profile && source ~/.bash_profile

    • You must configure NGC CLI for your use so that you can run the commands. Enter the following command, including your API key when prompted.

    Copy
    Copied!
                

    $ ngc config set Enter API key [no-apikey]. Choices: [<VALID_APIKEY>, 'no-apikey']: Enter CLI output format type [ascii]. Choices: [ascii, csv, json]: ascii Enter org [no-org]. Choices: ['no-org']: Enter team [no-team]. Choices: ['no-team']: Enter ace [no-ace]. Choices: ['no-ace']: Successfully saved NGC configuration to /home/$username/.ngc/config

    • Download the NVIDIA AI Enterprise Software Driver.

Installing the NVIDIA Driver using the .run file with RHEL

Important

Before starting the driver install Secure Boot will need to be disabled as shown in Installing Red Hat Enterprise Linux 8.4 section.

  1. Register machine to RHEL using subscription-manager with the command below.

    Copy
    Copied!
                

    $ subscription-manager register


  2. Satisfy the external dependency for EPEL for DKMS.

    Copy
    Copied!
                

    $ dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm


  3. For RHEL 8, ensure that the system has the correct Linux kernel sources from the Red Hat repositories.

    Copy
    Copied!
                

    $ dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)

    Note

    The NVIDIA driver requires that the kernel headers and development packages for the running version of the kernel be installed at the time of the driver installation, as well whenever the driver is rebuilt. For example, if your system is running kernel version 4.4.0, the 4.4.0 kernel headers and development packages must also be installed.


  4. Install additional dependencies for NVIDIA drivers.

    Copy
    Copied!
                

    $ dnf install elfutils-libelf-devel.x86_64 $ dnf install -y tar bzip2 make automake gcc gcc-c++ pciutils libglvnd-devel


  5. Update the running kernel:

    Copy
    Copied!
                

    $ dnf install -y kernel kernel-core kernel-modules


  6. Confirm the system has the correct Linux kernel sources from the Red Hat repositories after update.

    Copy
    Copied!
                

    $ dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)


  7. Download the NVIDIA AI Enterprise Software Driver.

    Copy
    Copied!
                

    $ ngc registry resource download-version "nvaie/vgpu_guest_driver_x_x:xxx.xx.xx"

    Note

    Where x_x:xxx.xx.xx is the current driver version from NGC Enterprise Catalog.


  8. Navigate to the directory containing the NVIDIA Driver .run file. Then, add the Executable permission to the NVIDIA Driver file using the chmod command.

    Copy
    Copied!
                

    $ sudo chmod +x NVIDIA-Linux-x86_64-xxx.xx.xx-grid.run

    Note

    Where xxx.xx.xx is the current driver version from NGC Enterprise Catalog.


  9. From the console shell, run the driver installer and accept defaults.

    Copy
    Copied!
                

    $ sudo sh ./NVIDIA-Linux-x86_64-xxx.xx.xx-grid.run

    Note

    Where xxx.xx.xx is the current driver version from NGC Enterprise Catalog.

    Note

    Accept any warnings and ignore the CC version check


  10. Reboot the system.

    Copy
    Copied!
                

    $ sudo reboot


  11. After the system has rebooted, confirm that you can see your NVIDIA vGPU device in the output from nvidia-smi.

    Copy
    Copied!
                

    $ nvidia-smi


After installing the NVIDIA vGPU compute driver, you can license any NVIDIA AI Enterprise Software licensed products you are using.

Once completed, check that the kernel has loaded the nvidia_vgpu_vfio module and that the nvidia-vgpu-mgr.service service is running.

Copy
Copied!
            

# lsmod | grep nvidia_vgpu_vfio

Copy
Copied!
            

nvidia_vgpu_vfio 45011 0 nvidia 14333621 10 nvidia_vgpu_vfio mdev 20414 2 vfio_mdev,nvidia_vgpu_vfio vfio 32695 3 vfio_mdev,nvidia_vgpu_vfio,vfio_iommu_type1

Copy
Copied!
            

# systemctl status nvidia-vgpu-mgr.service

Copy
Copied!
            

nvidia-vgpu-mgr.service - NVIDIA vGPU Manager Daemon Loaded: loaded (/usr/lib/systemd/system/nvidia-vgpu-mgr.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2018-03-16 10:17:36 CET; 5h 8min ago Main PID: 1553 (nvidia-vgpu-mgr) [...]

If creating a vGPU based on an NVIDIA Ampere GPU device, ensure that virtual functions are enabled for the physical GPU. For instructions, please refer to Creating an NVIDIA vGPU that supports SR-IOV on Linux with KVM Hypervisor.

Generate a device UUID.

Copy
Copied!
            

# uuidgen

Example result.

Copy
Copied!
            

30820a6f-b1a5-4503-91ca-0c10ba58692a

Prepare an XML file with a configuration of the mediated device, based on the detected GPU hardware. For example, the following configures a mediated device of the nvidia-321 vGPU type on an NVIDIA T4 card that runs on the 0000:65:00.0 PCI bus and uses the UUID generated in the previous step.

Copy
Copied!
            

<device> <parent>pci_0000_65_00_0</parent> <capability type="mdev"> <type id="nvidia-321"/> <uuid>d7462e1e-eda8-4f59-b760-6cecd9ec3671</uuid> </capability> </device>

rhel-vgpu-02.png

To find your vGPU profile name and description, navigate to mdev_supported_types and list the description and name. An example is shown below using the NVIDIA T4 with profile name nvidia-321 that corresponds to a T4-16C NVIDIA vGPU profile.

Copy
Copied!
            

cd /sys/bus/pci/devices/0000:65:00.0/mdev_supported_types/nvidia-321

rhel-vgpu-03.png

Note

For more information on how to locate the correct profile name for various NVIDIA vGPU profiles, please refer here.

Define a vGPU mediated device based on the XML file you prepared. For example:

Copy
Copied!
            

# virsh nodedev-define vgpu.xml

Tip

Verify that the mediated device is listed as inactive.

Copy
Copied!
            

# virsh nodedev-list --cap mdev --inactive

Copy
Copied!
            

mdev_30820a6f_b1a5_4503_91ca_0c10ba58692a_0000_01_00_0

Start the vGPU mediated device you created.

Copy
Copied!
            

# virsh nodedev-start mdev_30820a6f_b1a5_4503_91ca_0c10ba58692a_0000_01_00_0

Copy
Copied!
            

Device mdev_30820a6f_b1a5_4503_91ca_0c10ba58692a_0000_01_00_0 started

Tip

Ensure that the mediated device is listed as active.

Copy
Copied!
            

# virsh nodedev-list --cap mdev

Copy
Copied!
            

mdev_30820a6f_b1a5_4503_91ca_0c10ba58692a_0000_01_00_0

Set the vGPU device to start automatically after the host reboots.

Copy
Copied!
            

# virsh nodedev-autostart mdev_30820a6f_b1a5_4503_91ca_0c10ba58692a_0000_01_00_0

Copy
Copied!
            

Device mdev_d196754e_d8ed_4f43_bf22_684ed698b08b_0000_9b_00_0 marked as autostarted

Attach the mediated device to a VM that you want to share the vGPU resources. To do so, add the following lines, along with the previously generated UUID, to the <devices/> sections in the XML configuration of the VM.

First navigate to where the VM’s XML configurations are located at path:

Copy
Copied!
            

cd /etc/libvirt/qemu/

Then nano into the XML configuration and look for the <devices/> section to add the following to the VM’s XML configuration.

Copy
Copied!
            

<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'> <source> <address uuid='d7462e1e-eda8-4f59-b760-6cecd9ec3671'/> </source> </hostdev>

rhel-vgpu-04.png

Important

Each UUID can only be assigned to one VM at a time. In addition, if the VM does not have QEMU video devices, such as virtio-vga, add also the ramfb=’on’ parameter on the <hostdev> line.

Now we will verify the capabilities of the vGPU created, and ensure it is listed as active and persistent.

Copy
Copied!
            

# virsh nodedev-info mdev_30820a6f_b1a5_4503_91ca_0c10ba58692a_0000_01_00_0

Copy
Copied!
            

Name: virsh nodedev-autostart mdev_30820a6f_b1a5_4503_91ca_0c10ba58692a_0000_01_00_0 Parent: pci_0000_01_00_0 Active: yes Persistent: yes Autostart: yes

Start the VM and verify that the guest operating system detects the mediated device as an NVIDIA GPU. For example:

Copy
Copied!
            

# lspci -d 10de: -k

rhel-vgpu-05.png

After installing the NVIDIA vGPU compute driver, you can license any NVIDIA AI Enterprise Software licensed products you are using.

Note

For additional information on how to manage your NVIDIA vGPU within the KVM hypevisor, please refer to NVIDIA vGPU software documentation in addition to the man virsh command for managing guests with virsh.

Tip

Please refer to the NVIDIA vGPU release notes for instructions on how to change the vGPU scheduling policy for time-sliced vGPUs.

To use an NVIDIA vGPU software licensed product, each client system to which a physical or virtual GPU is assigned must be able to obtain a license from the NVIDIA License System. A client system can be a VM that is configured with NVIDIA vGPU, a VM that is configured for GPU pass through, or a physical host to which a physical GPU is assigned in a bare-metal deployment.

© Copyright 2022-2023, NVIDIA. Last updated on Sep 11, 2023.