Create the Cluster Policy Instance#

Added in version 2.0.

Create the Cluster Policy using the web console#

  1. In the OpenShift Container Platform web console, from the side menu, select Operators > Installed Operators, and click NVIDIA GPU Operator.

  2. Select the ClusterPolicy tab, then click Create ClusterPolicy.

    Note

    The platform assigns the default name gpu-cluster-policy.

  3. You can use this screen to customize the ClusterPolicy however the default are sufficient to get the GPU configured and running.

  4. Click Create

  5. At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. Wait at least 10-20 minutes before digging deeper into any form of troubleshooting because this may take a period of time to finish.

Verify Cluster Policy#

The status of the newly deployed ClusterPolicy gpu-cluster-policy for the NVIDIA GPU Operator changes to State:ready when the installation succeeds.

_images/os-on-bm-cluster1.png

Verify GPUs are available to nodes from the CLI use:

$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu'

This lists each node and the number of GPUs it has available to Kubernetes.

Eaxmple output:

$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu'
Node                           GPUs
nvaie-ocp-7rfr8-master-0       <none>
nvaie-ocp-7rfr8-master-1       <none>
nvaie-ocp-7rfr8-master-2       <none>
nvaie-ocp-7rfr8-worker-7x5km   1
nvaie-ocp-7rfr8-worker-9jgmk   <none>
nvaie-ocp-7rfr8-worker-jntsp   1

Verify the successful installation of the NVIDIA GPU Operator#

Verify the successful installation of the NVIDIA GPU Operator as shown here:

Run the following command to view these new pods and daemonsets:

$ oc get pods,daemonset -n nvidia-gpu-operator

NAME                                                                  READY   STATUS      RESTARTS   AGE

pod/bb0dd90f1b757a8c7b338785a4a65140732d30447093bc2c4f6ae8e75844gfv   0/1     Completed   0          94m

pod/gpu-feature-discovery-hlpgs                                       1/1     Running     0          91m

pod/gpu-operator-8dc8d6648-jzhnr                                      1/1     Running     0          94m

pod/nvidia-container-toolkit-daemonset-z2wh7                          1/1     Running     0          91m

pod/nvidia-cuda-validator-8fx22                                       0/1     Completed   0          86m

pod/nvidia-dcgm-exporter-ds9xd                                        1/1     Running     0          91m

pod/nvidia-dcgm-k7tz6                                                 1/1     Running     0          91m

pod/nvidia-device-plugin-daemonset-nqxmc                              1/1     Running     0          91m

pod/nvidia-device-plugin-validator-87zdl                              0/1     Completed   0          86m

pod/nvidia-driver-daemonset-48.84.202110270303-0-9df9j                2/2     Running     0          91m

pod/nvidia-node-status-exporter-7bhdk                                 1/1     Running     0          91m

pod/nvidia-operator-validator-kjznr                                   1/1     Running     0          91m

pod/openshift-psap-ci-artifacts-operator-bundle-gpu-operator-master   1/1     Running     0          94m



NAME                                                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                        AGE

daemonset.apps/gpu-feature-discovery                          1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                     91m

daemonset.apps/nvidia-container-toolkit-daemonset             1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true                                                                         91m

daemonset.apps/nvidia-dcgm                                    1         1         1       1            1           nvidia.com/gpu.deploy.dcgm=true                                                                                      91m

daemonset.apps/nvidia-dcgm-exporter                           1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                             91m

daemonset.apps/nvidia-device-plugin-daemonset                 1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true                                                                             91m

daemonset.apps/nvidia-driver-daemonset-48.84.202110270303-0   1         1         1       1            1           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=48.84.202110270303-0,nvidia.com/gpu.deploy.driver=true   91m

daemonset.apps/nvidia-mig-manager                             0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                                                               91m

daemonset.apps/nvidia-node-status-exporter                    1         1         1       1            1           nvidia.com/gpu.deploy.node-status-exporter=true                                                                      91m

daemonset.apps/nvidia-operator-validator                      1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true                                                                        91m

The nvidia-driver-daemonset pod runs on each worker node that conatins a supported NVIDIA GPU.

Note

When the Driver Toolkit is active, the DaemonSet is named nvidia-driver-daemonset-<RHCOS-version>. Where RHCOS-version equals <OCP XY>.<RHEL XY>.<related date YYYYMMDDHHSS-0. The pods of the DaemonSet are named nvidia-driver-daemonset-<RHCOS-version>-<UUID>.