Is this page helpful?

Install the NVIDIA GPU Operator#

The NVIDIA GPU Operator is a key element of the Red Hat AI Factory. It simplifies cluster administration and improves operational efficiency by automating the provisioning and management of NVIDIA GPU resources on the Red Hat OpenShift cluster. It prepares bare metal GPU nodes to be used as accelerated compute resources for AI and machine learning workloads. This is achieved by handling the installation and configuration of essential software, including NVIDIA drivers, the Kubernetes device plugin, the NVIDIA Container Toolkit, and monitoring utilities like DCGM, which also provides the resource management required for production AI deployments.

Refer to the GPU Operator documentation on installing the GPU Operator on Red Hat OpenShift for guidance on installing and configuring the GPU Operator.

Install the NVIDIA GPU Operator using the Web Console#

Install the NVIDIA GPU Operator using the Red Hat Software Catalog (Red Hat OperatorHub in versions before 4.20).

Access the Red Hat OpenShift console.
Navigate to Ecosystem -> Software Catalog.
Search for “NVIDIA GPU Operator” (or GPU Operator).
Select the NVIDIA GPU Operator.
Click Install.
Select the desired Installation mode (usually “All namespaces on the cluster (default)” for cluster-wide functionality).
Select the Installed Namespace (usually nvidia-gpu-operator).
Select the desired Update approval strategy (Manual or Automatic).
Click Install.
Wait for the Operator to be installed and its status to change to Succeeded on the Installed Operators page.

Note

After installation, you will need to create an instance of the ClusterPolicy Custom Resource (CR) to deploy the necessary GPU components. This is typically done on the Operator Details page by clicking Create instance on the “ClusterPolicy” API.

Refer to the GPU Operator documentation for guidance on creating the ClusterPolicy and configuring the GPU Operator for your cluster’s requirements.

The steps to create the default ClusterPolicy for the NVIDIA GPU Operator using the OpenShift web console are:

In the OpenShift Container Platform web console, from the side menu, select Operators > Installed Operators, and click NVIDIA GPU Operator.
Select the ClusterPolicy tab, then click Create ClusterPolicy. (The platform assigns the default name gpu-cluster-policy.)
Click Create. (The default settings are sufficient to configure and run the GPU.)
Once created, the GPU Operator will proceed to create pods in the nvidia-gpu-operator namespace and install all necessary components to set up the NVIDIA GPUs.

Verify availability of GPU resources#

The GPU operator will deploy a pod on each GPU node that manages the drivers and can be used to get information about the GPU’s. The nvidia-smi command shows memory usage, GPU utilization, and the temperature of the GPU. Test the GPU access by running the popular nvidia-smi command within the pod.

The following steps will verify the driver installation and confirm GPU resources are ready to be used within the cluster.

Change to the nvidia-gpu-operator project:
```
oc project nvidia-gpu-operator
```

Run the following command to view GPU Operator pods:

oc get pod -o wide -l openshift.driver-toolkit=true

Example Output:

NAME                                           READY   STATUS                  RESTARTS         AGE     IP             NODE                          NOMINATED NODE   READINESS GATES
nvidia-driver-daemonset-9.6.20251219-0-mv6mb   2/2     Running                 2                15d     10.129.0.8     inference-tme-worker-0        <none>           <none>
nvidia-driver-daemonset-9.6.20251219-0-n2wwl   2/2     Running                 2                15d     10.128.1.23    inference-tme-master          <none>           <none>
nvidia-driver-daemonset-9.6.20251219-0-pkrvt   2/2     Running                 2                15d     10.130.0.8     inference-tme-worker-1        <none>           <none>

Run the nvidia-smi command within the GPU pods:

for pod in $(oc get pods -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver -o jsonpath='{.items[*].metadata.name}'); do
  echo "=== $pod ==="
  oc exec -n nvidia-gpu-operator -c nvidia-driver-ctr "$pod" -- nvidia-smi
  echo ""
done

Example Output:

Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, openshift-driver-toolkit-ctr, k8s-driver-manager (init)
Mon Apr 11 15:02:23 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P8    15W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Two tables are generated. The first table reflects the information about all available GPUs (the example shows one GPU). The second table provides details about the processes using the GPUs.

For more information describing the contents of the tables, refer to the man page for nvidia-smi.