Create the Cluster Policy Instance#
Added in version 2.0.
Create the Cluster Policy using the web console#
In the OpenShift Container Platform web console, from the side menu, select Operators > Installed Operators, and click NVIDIA GPU Operator.
Select the ClusterPolicy tab, then click Create ClusterPolicy.
Note
The platform assigns the default name gpu-cluster-policy.
You can use this screen to customize the ClusterPolicy however the default are sufficient to get the GPU configured and running.
Click Create
At this point, the GPU Operator proceeds and installs all the required components to set up the NVIDIA GPUs in the OpenShift 4 cluster. Wait at least 10-20 minutes before digging deeper into any form of troubleshooting because this may take a period of time to finish.
Verify Cluster Policy#
The status of the newly deployed ClusterPolicy gpu-cluster-policy for the NVIDIA GPU Operator changes to State:ready
when the installation succeeds.
Verify GPUs are available to nodes from the CLI use:
$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu'
This lists each node and the number of GPUs it has available to Kubernetes.
Eaxmple output:
$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu' Node GPUs nvaie-ocp-7rfr8-master-0 <none> nvaie-ocp-7rfr8-master-1 <none> nvaie-ocp-7rfr8-master-2 <none> nvaie-ocp-7rfr8-worker-7x5km 1 nvaie-ocp-7rfr8-worker-9jgmk <none> nvaie-ocp-7rfr8-worker-jntsp 1
Verify the successful installation of the NVIDIA GPU Operator#
Verify the successful installation of the NVIDIA GPU Operator as shown here:
Run the following command to view these new pods and daemonsets:
$ oc get pods,daemonset -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE pod/bb0dd90f1b757a8c7b338785a4a65140732d30447093bc2c4f6ae8e75844gfv 0/1 Completed 0 94m pod/gpu-feature-discovery-hlpgs 1/1 Running 0 91m pod/gpu-operator-8dc8d6648-jzhnr 1/1 Running 0 94m pod/nvidia-container-toolkit-daemonset-z2wh7 1/1 Running 0 91m pod/nvidia-cuda-validator-8fx22 0/1 Completed 0 86m pod/nvidia-dcgm-exporter-ds9xd 1/1 Running 0 91m pod/nvidia-dcgm-k7tz6 1/1 Running 0 91m pod/nvidia-device-plugin-daemonset-nqxmc 1/1 Running 0 91m pod/nvidia-device-plugin-validator-87zdl 0/1 Completed 0 86m pod/nvidia-driver-daemonset-48.84.202110270303-0-9df9j 2/2 Running 0 91m pod/nvidia-node-status-exporter-7bhdk 1/1 Running 0 91m pod/nvidia-operator-validator-kjznr 1/1 Running 0 91m pod/openshift-psap-ci-artifacts-operator-bundle-gpu-operator-master 1/1 Running 0 94m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/gpu-feature-discovery 1 1 1 1 1 nvidia.com/gpu.deploy.gpu-feature-discovery=true 91m daemonset.apps/nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 91m daemonset.apps/nvidia-dcgm 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm=true 91m daemonset.apps/nvidia-dcgm-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm-exporter=true 91m daemonset.apps/nvidia-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.device-plugin=true 91m daemonset.apps/nvidia-driver-daemonset-48.84.202110270303-0 1 1 1 1 1 feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=48.84.202110270303-0,nvidia.com/gpu.deploy.driver=true 91m daemonset.apps/nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 91m daemonset.apps/nvidia-node-status-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.node-status-exporter=true 91m daemonset.apps/nvidia-operator-validator 1 1 1 1 1 nvidia.com/gpu.deploy.operator-validator=true 91m
The nvidia-driver-daemonset
pod runs on each worker node that conatins a supported NVIDIA GPU.
Note
When the Driver Toolkit is active, the DaemonSet
is named nvidia-driver-daemonset-<RHCOS-version>
. Where RHCOS-version
equals <OCP XY>.<RHEL XY>.<related date YYYYMMDDHHSS-0
. The pods of the DaemonSet
are named nvidia-driver-daemonset-<RHCOS-version>-<UUID>
.