Troubleshooting
This section includes errors that users may encounter when performing various checks during installing the NVIDIA GPU Operator on the OpenShift Container Platform cluster.
Node Feature Discovery checks
Verify the Node Feature Discovery has been created:
$ oc get NodeFeatureDiscovery -n openshift-nfd
NAME AGE nfd-instance 4h11m
Note
If empty the Node Feature Discovery Custom Resource (CR) must be created.
Check there are nodes with GPU. In this example the check is performed for the NVIDIA GPU which uses the PCI ID 10de.
$ oc get nodes -l feature.node.kubernetes.io/pci-10de.present
NAME STATUS ROLES AGE VERSION ip-10-0-133-209.ec2.internal Ready worker 4h13m v1.21.1+9807387
GPU Operator checks
Check the Custom Resource Definition (CRD) is deployed.
$ oc get crd/clusterpolicies.nvidia.com
NAME CREATED AT clusterpolicies.nvidia.com 2021-09-02T10:33:50Z
Note
If missing, then the OperatorHub install went wrong.
Check the cluster policy is deployed:
$ oc get clusterpolicy
NAME AGE gpu-cluster-policy 8m25s
Note
If missing, the custom resource (CR) must be created from the OperatorHub.
Check that the Operator is running:
$ oc get pods -n openshift-operators -lapp=gpu-operator
gpu-operator-6b8b8c5fd9-zcs9r 1/1 Running 0 3h55m
Note
If ImagePullBackOff is reported, maybe the NVIDIA registry is down. If CrashLoopBackOff is reported then the operator logs can be reviewed:
$ oc logs -f -n openshift-operators -lapp=gpu-operator
Validate the GPU stack
The GPU Operator validates the stack through the nvidia-device-plugin-validator
and the nvidia-cuda-validator
pod. If both completed successfully, the stack works as expected.
$ oc get po -n gpu-operator-resourcesNAME READY STATUS RESTARTS AGE gpu-feature-discovery-kfmcm 1/1 Running 0 4h14m nvidia-container-toolkit-daemonset-t5vgq 1/1 Running 0 4h14m nvidia-cuda-validator-2wjlm 0/1 Completed 0 97m nvidia-dcgm-exporter-tsjk7 1/1 Running 0 4h14m nvidia-dcgm-r7qbd 1/1 Running 0 4h14m nvidia-device-plugin-daemonset-zlchl 1/1 Running 0 4h14m nvidia-device-plugin-validator-76pts 0/1 Completed 0 96m nvidia-driver-daemonset-6zk6b 1/1 Running 32 4h14m nvidia-node-status-exporter-27jdc 1/1 Running 1 4h14m nvidia-operator-validator-cjsw7 1/1 Running 0 4h14m
Check the cuda validator logs:
$ oc logs -f nvidia-cuda-validator-2wjlm -n gpu-operator-resources
cuda workload validation is successful
Check the nvidia-device-plugin-validator logs:
$ oc logs nvidia-device-plugin-validator-76pts -n gpu-operator-resources | tail
device-plugin workload validation is successful
Check the NVIDIA driver deployment
This is an illustrated example of a situation where the deployment of the Operator is not proceeding as expected.
Check the pods deployed to the gpu-operator-resources namespace:
$ oc get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE gpu-feature-discovery-kfmcm 0/1 Init:0/1 0 53m nvidia-container-toolkit-daemonset-t5vgq 0/1 Init:0/1 0 53m nvidia-dcgm-exporter-tsjk7 0/1 Init:0/2 0 53m nvidia-dcgm-r7qbd 0/1 Init:0/1 0 53m nvidia-device-plugin-daemonset-zlchl 0/1 Init:0/1 0 53m nvidia-driver-daemonset-6zk6b 0/1 CrashLoopBackOff 13 53m nvidia-node-status-exporter-27jdc 1/1 Running 0 53m nvidia-operator-validator-cjsw7 0/1 Init:0/4 0 53m
The Init status indicates the driver pod is not ready. In this example the driver Pod is in state CrashLoopBackOff. This combined with the RESTARTS equal to 13 indicates a problem.
Check the main console page:
The first alert shows that the “nvidia driver could not be deployed”.
Note
Alerts are automatically enabled and logged in the console. For more information on alerts see, the OpenShift Container Platform documentation.
Check the NVIDIA driver logs:
$ oc logs -f nvidia-driver-daemonset-6zk6b -n gpu-operator-resources
+ echo 'Installing elfutils...' Installing elfutils... + dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64 Error: Unable to find a match: elfutils-libelf-devel.x86_64 ++ rm -rf /tmp/tmp.3jt46if6eF + _shutdown-1.8 + _unload_driver-1.8 + rmmod_args=() + local rmmod_args + local nvidia_deps=0 + local nvidia_refs=0 + local nvidia_uvm_refs=0 + local nvidia_modeset_refs=0 + echo 'Stopping NVIDIA persistence daemon...' Stopping NVIDIA persistence daemon...
In the logs this line below indicates there is an entitlement issue:
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64 Error: Unable to find a match: elfutils-libelf-devel.x86_64
This error indicates that the UBI-based driver pod does not have subscription entitlements correctly mounted so that additional required UBI packages are not found. Please refer to this section Obtaining an entitlement certificate.