Install the NVIDIA GPU Operator#
The NVIDIA GPU Operator is a key element of the Red Hat AI Factory. It simplifies cluster administration and improves operational efficiency by automating the provisioning and management of NVIDIA GPU resources on the Red Hat OpenShift cluster. It prepares bare metal GPU nodes to be used as accelerated compute resources for AI and machine learning workloads. This is achieved by handling the installation and configuration of essential software, including NVIDIA drivers, the Kubernetes device plugin, the NVIDIA Container Toolkit, and monitoring utilities like DCGM, which also provides the resource management required for production AI deployments.
Refer to the GPU Operator documentation on installing the GPU Operator on Red Hat OpenShift for guidance on installing and configuring the GPU Operator.
Verify availability of GPU resources#
The GPU operator will deploy a pod on each GPU node that manages the drivers and can be used to get information about the GPU’s. The nvidia-smi command shows memory usage, GPU utilization, and the temperature of the GPU. Test the GPU access by running the popular nvidia-smi command within the pod.
The following steps will verify the driver installation and confirm GPU resources are ready to be used within the cluster.
Change to the nvidia-gpu-operator project:
oc project nvidia-gpu-operatorRun the following command to view GPU Operator pods:
oc get pod -o wide -l openshift.driver-toolkit=trueExample Output:
1NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 2nvidia-driver-daemonset-9.6.20251219-0-mv6mb 2/2 Running 2 15d 10.129.0.8 inference-tme-worker-0 <none> <none> 3nvidia-driver-daemonset-9.6.20251219-0-n2wwl 2/2 Running 2 15d 10.128.1.23 inference-tme-master <none> <none> 4nvidia-driver-daemonset-9.6.20251219-0-pkrvt 2/2 Running 2 15d 10.130.0.8 inference-tme-worker-1 <none> <none>
Run the
nvidia-smicommand within the GPU pods:1for pod in $(oc get pods -n nvidia-gpu-operator -l app.kubernetes.io/component=nvidia-driver -o jsonpath='{.items[*].metadata.name}'); do 2 echo "=== $pod ===" 3 oc exec -n nvidia-gpu-operator -c nvidia-driver-ctr "$pod" -- nvidia-smi 4 echo "" 5done
Example Output:
1Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, openshift-driver-toolkit-ctr, k8s-driver-manager (init) 2Mon Apr 11 15:02:23 2022 3+-----------------------------------------------------------------------------+ 4| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | 5|-------------------------------+----------------------+----------------------+ 6| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 7| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 8| | | MIG M. | 9|===============================+======================+======================| 10| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | 11| N/A 33C P8 15W / 70W | 0MiB / 15360MiB | 0% Default | 12| | | N/A | 13+-------------------------------+----------------------+----------------------+ 14 15+-----------------------------------------------------------------------------+ 16| Processes: | 17| GPU GI CI PID Type Process name GPU Memory | 18| ID ID Usage | 19|=============================================================================| 20| No running processes found | 21+-----------------------------------------------------------------------------+
Two tables are generated. The first table reflects the information about all available GPUs (the example shows one GPU). The second table provides details about the processes using the GPUs.
For more information describing the contents of the tables, refer to the man page for
nvidia-smi.