Install the NVIDIA GPU Operator#
The NVIDIA GPU Operator is a key element of the Red Hat AI Factory. It simplifies cluster administration and improves operational efficiency by automating the provisioning and management of NVIDIA GPU resources on the Red Hat OpenShift cluster. It prepares bare metal GPU nodes to be used as accelerated compute resources for AI and machine learning workloads. This is achieved by handling the installation and configuration of essential software, including NVIDIA drivers, the Kubernetes device plugin, the NVIDIA Container Toolkit, and monitoring utilities like DCGM, which also provides the resource management required for production AI deployments.
Refer to the GPU Operator documentation on installing the GPU Operator on Red Hat OpenShift for guidance on installing and configuring the GPU Operator.
Install the NVIDIA GPU Operator using the Web Console#
Install the NVIDIA GPU Operator using the Red Hat Software Catalog (Red Hat OperatorHub in versions before 4.20).
Access the Red Hat OpenShift console.
Navigate to Ecosystem -> Software Catalog.
Search for “NVIDIA GPU Operator” (or GPU Operator).
Select the NVIDIA GPU Operator.
Click Install.
Select the desired Installation mode (usually “All namespaces on the cluster (default)” for cluster-wide functionality).
Select the Installed Namespace (usually
nvidia-gpu-operator).Select the desired Update approval strategy (Manual or Automatic).
Click Install.
Wait for the Operator to be installed and its status to change to Succeeded on the Installed Operators page.
Note
After installation, you will need to create an instance of the ClusterPolicy Custom Resource (CR) to deploy the necessary GPU components. This is typically done on the Operator Details page by clicking Create instance on the “ClusterPolicy” API.
Refer to the GPU Operator documentation for guidance on creating the ClusterPolicy and configuring the GPU Operator for your cluster’s requirements.
The steps to create the default ClusterPolicy for the NVIDIA GPU Operator using the OpenShift web console are:
In the OpenShift Container Platform web console, from the side menu, select Operators > Installed Operators, and click NVIDIA GPU Operator.
Select the ClusterPolicy tab, then click Create ClusterPolicy. (The platform assigns the default name
gpu-cluster-policy.)Click Create. (The default settings are sufficient to configure and run the GPU.)
Once created, the GPU Operator will proceed to create pods in the
nvidia-gpu-operatornamespace and install all necessary components to set up the NVIDIA GPUs.
Verify availability of GPU resources#
The GPU operator will deploy a pod on each GPU node that manages the drivers and can be used to get information about the GPU’s. The nvidia-smi command shows memory usage, GPU utilization, and the temperature of the GPU. Test the GPU access by running the popular nvidia-smi command within the pod.
The following steps will verify the driver installation and confirm GPU resources are ready to be used within the cluster.
Change to the nvidia-gpu-operator project:
oc project nvidia-gpu-operatorRun the following command to view GPU Operator pods:
oc get pod -o wide -l openshift.driver-toolkit=trueExample Output:
1NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 2nvidia-driver-daemonset-9.6.20251219-0-mv6mb 2/2 Running 2 15d 10.129.0.8 inference-tme-worker-0 <none> <none> 3nvidia-driver-daemonset-9.6.20251219-0-n2wwl 2/2 Running 2 15d 10.128.1.23 inference-tme-master <none> <none> 4nvidia-driver-daemonset-9.6.20251219-0-pkrvt 2/2 Running 2 15d 10.130.0.8 inference-tme-worker-1 <none> <none>
Run the
nvidia-smicommand within the GPU pods:1for pod in $(oc get pods -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver -o jsonpath='{.items[*].metadata.name}'); do 2 echo "=== $pod ===" 3 oc exec -n nvidia-gpu-operator -c nvidia-driver-ctr "$pod" -- nvidia-smi 4 echo "" 5done
Example Output:
1Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, openshift-driver-toolkit-ctr, k8s-driver-manager (init) 2Mon Apr 11 15:02:23 2022 3+-----------------------------------------------------------------------------+ 4| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | 5|-------------------------------+----------------------+----------------------+ 6| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 7| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 8| | | MIG M. | 9|===============================+======================+======================| 10| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | 11| N/A 33C P8 15W / 70W | 0MiB / 15360MiB | 0% Default | 12| | | N/A | 13+-------------------------------+----------------------+----------------------+ 14 15+-----------------------------------------------------------------------------+ 16| Processes: | 17| GPU GI CI PID Type Process name GPU Memory | 18| ID ID Usage | 19|=============================================================================| 20| No running processes found | 21+-----------------------------------------------------------------------------+
Two tables are generated. The first table reflects the information about all available GPUs (the example shows one GPU). The second table provides details about the processes using the GPUs.
For more information describing the contents of the tables, refer to the man page for
nvidia-smi.