NVIDIA AI Enterprise 2.0 or later
In the OpenShift Container Platform web console, from the side menu, select Operators > Installed Operators, and click NVIDIA GPU Operator.
Select the ClusterPolicy tab, then click Create ClusterPolicy.
NoteThe platform assigns the default name gpu-cluster-policy.
You can use this screen to customize the ClusterPolicy however the default are sufficient to get the GPU configured and running.
Click Create
At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. Wait at least 10-20 minutes before digging deeper into any form of troubleshooting because this may take a period of time to finish.
The status of the newly deployed ClusterPolicy gpu-cluster-policy for the NVIDIA GPU Operator changes to State:ready
when the installation succeeds.

Verify GPUs are available to nodes from the CLI use:
$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu'
This lists each node and the number of GPUs it has available to Kubernetes.
Eaxmple output:
$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu'
Node GPUs
nvaie-ocp-7rfr8-master-0 <none>
nvaie-ocp-7rfr8-master-1 <none>
nvaie-ocp-7rfr8-master-2 <none>
nvaie-ocp-7rfr8-worker-7x5km 1
nvaie-ocp-7rfr8-worker-9jgmk <none>
nvaie-ocp-7rfr8-worker-jntsp 1
Verify the successful installation of the NVIDIA GPU Operator
Verify the successful installation of the NVIDIA GPU Operator as shown here:
Run the following command to view these new pods and daemonsets:
$ oc get pods,daemonset -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
pod/bb0dd90f1b757a8c7b338785a4a65140732d30447093bc2c4f6ae8e75844gfv 0/1 Completed 0 94m
pod/gpu-feature-discovery-hlpgs 1/1 Running 0 91m
pod/gpu-operator-8dc8d6648-jzhnr 1/1 Running 0 94m
pod/nvidia-container-toolkit-daemonset-z2wh7 1/1 Running 0 91m
pod/nvidia-cuda-validator-8fx22 0/1 Completed 0 86m
pod/nvidia-dcgm-exporter-ds9xd 1/1 Running 0 91m
pod/nvidia-dcgm-k7tz6 1/1 Running 0 91m
pod/nvidia-device-plugin-daemonset-nqxmc 1/1 Running 0 91m
pod/nvidia-device-plugin-validator-87zdl 0/1 Completed 0 86m
pod/nvidia-driver-daemonset-48.84.202110270303-0-9df9j 2/2 Running 0 91m
pod/nvidia-node-status-exporter-7bhdk 1/1 Running 0 91m
pod/nvidia-operator-validator-kjznr 1/1 Running 0 91m
pod/openshift-psap-ci-artifacts-operator-bundle-gpu-operator-master 1/1 Running 0 94m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/gpu-feature-discovery 1 1 1 1 1 nvidia.com/gpu.deploy.gpu-feature-discovery=true 91m
daemonset.apps/nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 91m
daemonset.apps/nvidia-dcgm 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm=true 91m
daemonset.apps/nvidia-dcgm-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm-exporter=true 91m
daemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.device-plugin=true 91m
daemonset.apps/nvidia-driver-daemonset-48.84.202110270303-0 1 1 1 1 1 feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=48.84.202110270303-0,nvidia.com/gpu.deploy.driver=true 91m
daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 91m
daemonset.apps/nvidia-node-status-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.node-status-exporter=true 91m
daemonset.apps/nvidia-operator-validator 1 1 1 1 1 nvidia.com/gpu.deploy.operator-validator=true 91m
The nvidia-driver-daemonset
pod runs on each worker node that conatins a supported NVIDIA GPU.
When the Driver Toolkit is active, the DaemonSet
is named nvidia-driver-daemonset-<RHCOS-version>
. Where RHCOS-version
equals <OCP XY>.<RHEL XY>.<related date YYYYMMDDHHSS-0
. The pods of the DaemonSet
are named nvidia-driver-daemonset-<RHCOS-version>-<UUID>
.