Installing the NVIDIA GPU Operator

With the proper Red Hat entitlement in place and the Node Feature Discovery Operator installed you can continue with the final step and install the NVIDIA GPU Operator.

In the OpenShift Container Platform web console from the side menu, select Operators > OperatorHub, then search for the NVIDIA GPU Operator. For additional information see the Red Hat OpenShift Container Platform documentation.
Select the NVIDIA GPU Operator, click Install. In the subsequent screen click Install.

Create the cluster policy for the NVIDIA GPU Operator

When you install the NVIDIA GPU Operator in the OpenShift Container Platform, a custom resource definition for a ClusterPolicy is created. The ClusterPolicy configures the GPU stack that will be deployed, configuring the image names and repository, pod restrictions/credentials and so on.

Note

If you create a ClusterPolicy that contains an empty specification, such as spec{}, the ClusterPolicy fails to deploy.

In the OpenShift Container Platform web console, from the side menu, select Operators > Installed Operators, then click NVIDIA GPU Operator.
Select the ClusterPolicy tab, then click Create ClusterPolicy. The platform assigns the default name gpu-cluster-policy.

Note

You can use this screen to customize the ClusterPolicy however the default are sufficient to get the GPU configured and running.
Click Create.

At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. This may take a while so be patient and wait at least 10-20 minutes before digging deeper into any form of troubleshooting.
The status of the newly deployed ClusterPolicy gpu-cluster-policy for the NVIDIA GPU Operator changes to State:ready once the installation succeeded.

Verify the successful installation of the NVIDIA GPU Operator

The commands below describe various ways to verify the successful installation of the NVIDIA GPU Operator.

Run the following command to view these new pods and daemonsets:

$ oc get pods,daemonset -n gpu-operator-resources

NAME                                           READY   STATUS      RESTARTS   AGE
pod/gpu-feature-discovery-vwhnt                1/1     Running     0          6m32s
pod/nvidia-container-toolkit-daemonset-k8x28   1/1     Running     0          6m33s
pod/nvidia-cuda-validator-xr5sz                0/1     Completed   0          90s
pod/nvidia-dcgm-5grvn                          1/1     Running     0          6m32s
pod/nvidia-dcgm-exporter-cp8ml                 1/1     Running     0          6m32s
pod/nvidia-device-plugin-daemonset-p9dp4       1/1     Running     0          6m32s
pod/nvidia-device-plugin-validator-mrhst       0/1     Completed   0          48s
pod/nvidia-driver-daemonset-pbplc              1/1     Running     0          6m33s
pod/nvidia-node-status-exporter-s2ml2          1/1     Running     0          6m33s
pod/nvidia-operator-validator-44jdf            1/1     Running     0          6m32s

NAME                                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
daemonset.apps/gpu-feature-discovery                1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true   6m32s
daemonset.apps/nvidia-container-toolkit-daemonset   1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true       6m33s
daemonset.apps/nvidia-dcgm                          1         1         1       1            1           nvidia.com/gpu.deploy.dcgm=true                    6m33s
daemonset.apps/nvidia-dcgm-exporter                 1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true           6m33s
daemonset.apps/nvidia-device-plugin-daemonset       1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true           6m33s
daemonset.apps/nvidia-driver-daemonset              1         1         1       1            1           nvidia.com/gpu.deploy.driver=true                  6m33s
daemonset.apps/nvidia-mig-manager                   0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             6m32s
daemonset.apps/nvidia-node-status-exporter          1         1         1       1            1           nvidia.com/gpu.deploy.node-status-exporter=true    6m34s
daemonset.apps/nvidia-operator-validator            1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true      6m33s

The nvidia-driver-daemonset pod runs on each worker node that contains a supported NVIDIA GPU.

Running a sample GPU Application

Run a simple CUDA VectorAdd sample, which adds two vectors together to ensure the GPUs have bootstrapped correctly.

Run the following:

$ cat << EOF | oc create -f -

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
 restartPolicy: OnFailure
 containers:
 - name: cuda-vectoradd
   image: "nvidia/samples:vectoradd-cuda11.2.1"
   resources:
     limits:
       nvidia.com/gpu: 1
EOF

pod/cuda-vectoradd created

Check the logs of the container:

$ oc logs cuda-vectoradd

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Getting information on the GPU

The nvidia-smi shows memory usage, GPU utilization and the temperature of GPU. Test the GPU access by running the popular nvidia-smi command within the pod.

To view GPU utilization, run nvidia-smi from a pod in the GPU Operator daemonset.

Change to the gpu-operator-resources project:
```
$ oc project gpu-operator-resources
```

Run the following command to view these new pods:

$ oc get pod -owide -lapp=nvidia-driver-daemonset

NAME                            READY   STATUS    RESTARTS   AGE     IP            NODE                          NOMINATED NODE   READINESS GATES
nvidia-driver-daemonset-pbplc   1/1     Running   0          8m17s   10.130.2.28   ip-10-0-143-64.ec2.internal   <none>           <none>

Note

The node is shown above, so with the Pod name, you can choose to execute the nvidia-smi on the correct node.

Run the nvidia-smi command within the pod:

$ oc exec -it nvidia-driver-daemonset-pbplc -- nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   40C    P8    16W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Two tables are generated the first reflects the information about all available GPUs (the example shows one GPU). The second table tells you about the processes using the GPUs.

For more information on the contents of the tables please refer to the man page for nvidia-smi.