NVIDIA vGPU

About Installing the Operator and NVIDIA vGPU
Platform Support
Prerequisites
Download vGPU Software
Build the Driver Container
Configure the Cluster with the vGPU License Information and the Driver Container Image
Next Steps

About Installing the Operator and NVIDIA vGPU

This document provides an overview of the workflow to getting started with using the GPU Operator with NVIDIA vGPU.

NVIDIA vGPU is only supported with the NVIDIA License System.

The installation steps assume gpu-operator as the default namespace for installing the NVIDIA GPU Operator. In case of Red Hat OpenShift Container Platform, the default namespace is nvidia-gpu-operator. Change the namespace shown in the commands accordingly based on your cluster configuration. Also replace kubectl in the below commands with oc when running on RedHat OpenShift.

The following list outlines the high-level steps to use the NVIDIA GPU Operator with NVIDIA vGPUs.

Platform Support

For information about the supported platforms, see Supported Deployment Options, Hypervisors, and NVIDIA vGPU Based Products.

For Red Hat OpenShift Virtualization, see NVIDIA GPU Operator with OpenShift Virtualization.

Prerequisites

You must have access to the NVIDIA Enterprise Application Hub at https://nvid.nvidia.com/dashboard/ and the NVIDIA Licensing Portal.
Your organization must have an instance of a Cloud License Service (CLS) or a Delegated License Service (DLS).
You must generate and download a client configuration token for your CLS instance or DLS instance. Refer to the NVIDIA License System Quick Start Guide for information about generating a token.
You have access to a private container registry, such as NVIDIA NGC Private Registry, and can push container images to the registry.

Download vGPU Software

Perform the following steps to download the vGPU software and the latest NVIDIA vGPU driver catalog file from the NVIDIA Licensing Portal.

Log in to the NVIDIA Enterprise Application Hub at https://nvid.nvidia.com/dashboard and then click NVIDIA LICENSING PORTAL.
In the left navigation pane of the NVIDIA Licensing Portal, click SOFTWARE DOWNLOADS.
Locate vGPU Driver Catalog in the table of driver downloads and click Download.
Click the PRODUCT FAMILY menu and select vGPU to filter the downloads to vGPU only.
Locate the vGPU software for your platform in the table of software downloads and click Download.

The vGPU software is packaged as a ZIP file. Unzip the file to obtain the NVIDIA vGPU Linux guest driver. The guest driver file name follows the pattern NVIDIA-Linux-x86_64-<version>-grid.run.

Build the Driver Container

Perform the following steps to build and push a container image that includes the vGPU Linux guest driver.

Clone the driver container repository and change directory into the repository:

$ git clone https://gitlab.com/nvidia/container-images/driver

$ cd driver

Change directory to the operating system name and version under the driver directory:
```
$ cd ubuntu20.04
```
For Red Hat OpenShift Container Platform, use a directory that includes rhel in the directory name.
Copy the NVIDIA vGPU guest driver from your extracted ZIP file and the NVIDIA vGPU driver catalog file:
```
$ cp <local-driver-download-directory>/*-grid.run drivers/
```
```
$ cp vgpuDriverCatalog.yaml drivers/
```
Set environment variables for building the driver container image.
- Specify your private registry URL:
```
$ export PRIVATE_REGISTRY=<private-registry-url>
```
- Specify the OS_TAG environment variable to identify the guest operating system name and version:
```
$ export OS_TAG=ubuntu20.04
```
  The value must match the guest operating system version. For Red Hat OpenShift Container Platform, specify rhcos4.<x> where x is the supported minor OCP version. Refer to Supported Operating Systems and Kubernetes Platforms for the list of supported OS distributions.
- Specify the driver container image tag, such as 1.0.0:
```
$ export VERSION=1.0.0
```
  The specified value can be any user-defined value. The value is used to install the Operator in a subsequent step.
- Specify the version of the CUDA base image to use when building the driver container:
```
$ export CUDA_VERSION=11.8.0
```
  The CUDA version only specifies which base image is used to build the driver container. The version does not have any correlation to the version of CUDA that is associated with or supported by the resulting driver container.
- Specify the Linux guest vGPU driver version that you downloaded from the NVIDIA Licensing Portal and append -grid:
```
$ export VGPU_DRIVER_VERSION=525.60.13-grid
```
  The Operator automatically selects the compatible guest driver version from the drivers bundled with the driver image. If you disable the version check by specifying --build-arg DISABLE_VGPU_VERSION_CHECK=true when you build the driver image, then the VGPU_DRIVER_VERSION value is used as default.

Build the driver container image:

$ sudo docker build \
    --build-arg DRIVER_TYPE=vgpu \
    --build-arg DRIVER_VERSION=$VGPU_DRIVER_VERSION \
    --build-arg CUDA_VERSION=$CUDA_VERSION \
    --build-arg TARGETARCH=amd64 \  # amd64 or arm64
    -t ${PRIVATE_REGISTRY}/driver:${VERSION}-${OS_TAG} .

Push the driver container image to your private registry.
1. Log in to your private registry:
```
$ sudo docker login ${PRIVATE_REGISTRY} --username=<username>
```
  Enter your password when prompted.
2. Push the driver container image to your private registry:
```
$ sudo docker push ${PRIVATE_REGISTRY}/driver:${VERSION}-${OS_TAG}
```

Configure the Cluster with the vGPU License Information and the Driver Container Image

Create an NVIDIA vGPU license file named gridd.conf with contents like the following example:

# Description: Set Feature to be enabled
# Data type: integer
# Possible values:
# 0 => for unlicensed state
# 1 => for NVIDIA vGPU
# 2 => for NVIDIA RTX Virtual Workstation
# 4 => for NVIDIA Virtual Compute Server
FeatureType=1

Rename the client configuration token file that you downloaded to client_configuration_token.tok using a command like the following example:
```
$ cp ~/Downloads/client_configuration_token_03-28-2023-16-16-36.tok client_configuration_token.tok
```
The file must be named client_configuraton_token.tok.

Create the gpu-operator namespace:

$ kubectl create namespace gpu-operator

Create a config map that is named licensing-config using the gridd.conf and client_configuration_token.tok files:

$ kubectl create configmap licensing-config \
    -n gpu-operator --from-file=gridd.conf --from-file=client_configuration_token.tok

Create an image pull secret in the gpu-operator namespace with the registry secret and private registry.

Set an environment variable with the name of the secret:
```
$ export REGISTRY_SECRET_NAME=registry-secret
```

Create the secret:

$ kubectl create secret docker-registry ${REGISTRY_SECRET_NAME} \
    --docker-server=${PRIVATE_REGISTRY} --docker-username=<username> \
    --docker-password=<password> \
    --docker-email=<email-id> -n gpu-operator

You need to specify the secret name REGISTRY_SECRET_NAME when you install the GPU Operator with Helm.

Next Steps

Refer to Install NVIDIA GPU Operator to install the GPU Operator on Kubernetes using a Helm chart.

Refer to Installing the NVIDIA GPU Operator to install the GPU Operator using NVIDIA vGPU on Red Hat OpenShift Container Platform.