Create the Cluster Policy Instance

NVIDIA AI Enterprise 2.0 or later

  1. In the OpenShift Container Platform web console, from the side menu, select Operators > Installed Operators, and click NVIDIA GPU Operator.

  2. Select the ClusterPolicy tab, then click Create ClusterPolicy.

    Note

    The platform assigns the default name gpu-cluster-policy.


  3. You can use this screen to customize the ClusterPolicy however the default are sufficient to get the GPU configured and running.

  4. Click Create

  5. At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. Wait at least 10-20 minutes before digging deeper into any form of troubleshooting because this may take a period of time to finish.

The status of the newly deployed ClusterPolicy gpu-cluster-policy for the NVIDIA GPU Operator changes to State:ready when the installation succeeds.

os-on-bm-cluster1.png


Verify GPUs are available to nodes from the CLI use:

Copy
Copied!
            

$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu'


This lists each node and the number of GPUs it has available to Kubernetes.

Eaxmple output:

Copy
Copied!
            

$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu' Node GPUs nvaie-ocp-7rfr8-master-0 <none> nvaie-ocp-7rfr8-master-1 <none> nvaie-ocp-7rfr8-master-2 <none> nvaie-ocp-7rfr8-worker-7x5km 1 nvaie-ocp-7rfr8-worker-9jgmk <none> nvaie-ocp-7rfr8-worker-jntsp 1


Verify the successful installation of the NVIDIA GPU Operator

Verify the successful installation of the NVIDIA GPU Operator as shown here:

Run the following command to view these new pods and daemonsets:

Copy
Copied!
            

$ oc get pods,daemonset -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE pod/bb0dd90f1b757a8c7b338785a4a65140732d30447093bc2c4f6ae8e75844gfv 0/1 Completed 0 94m pod/gpu-feature-discovery-hlpgs 1/1 Running 0 91m pod/gpu-operator-8dc8d6648-jzhnr 1/1 Running 0 94m pod/nvidia-container-toolkit-daemonset-z2wh7 1/1 Running 0 91m pod/nvidia-cuda-validator-8fx22 0/1 Completed 0 86m pod/nvidia-dcgm-exporter-ds9xd 1/1 Running 0 91m pod/nvidia-dcgm-k7tz6 1/1 Running 0 91m pod/nvidia-device-plugin-daemonset-nqxmc 1/1 Running 0 91m pod/nvidia-device-plugin-validator-87zdl 0/1 Completed 0 86m pod/nvidia-driver-daemonset-48.84.202110270303-0-9df9j 2/2 Running 0 91m pod/nvidia-node-status-exporter-7bhdk 1/1 Running 0 91m pod/nvidia-operator-validator-kjznr 1/1 Running 0 91m pod/openshift-psap-ci-artifacts-operator-bundle-gpu-operator-master 1/1 Running 0 94m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/gpu-feature-discovery 1 1 1 1 1 nvidia.com/gpu.deploy.gpu-feature-discovery=true 91m daemonset.apps/nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 91m daemonset.apps/nvidia-dcgm 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm=true 91m daemonset.apps/nvidia-dcgm-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm-exporter=true 91m daemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.device-plugin=true 91m daemonset.apps/nvidia-driver-daemonset-48.84.202110270303-0 1 1 1 1 1 feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=48.84.202110270303-0,nvidia.com/gpu.deploy.driver=true 91m daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 91m daemonset.apps/nvidia-node-status-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.node-status-exporter=true 91m daemonset.apps/nvidia-operator-validator 1 1 1 1 1 nvidia.com/gpu.deploy.operator-validator=true 91m


The nvidia-driver-daemonset pod runs on each worker node that conatins a supported NVIDIA GPU.

Note

When the Driver Toolkit is active, the DaemonSet is named nvidia-driver-daemonset-<RHCOS-version>. Where RHCOS-version equals <OCP XY>.<RHEL XY>.<related date YYYYMMDDHHSS-0. The pods of the DaemonSet are named nvidia-driver-daemonset-<RHCOS-version>-<UUID>.

Previous Install the NVIDIA GPU Operator
Next Deploying NVIDIA AI Enterprise Containers
© Copyright 2024, NVIDIA. Last updated on Apr 2, 2024.