Troubleshooting

This section includes errors that users may encounter when performing various checks during installing the NVIDIA GPU Operator on the OpenShift Container Platform cluster.

Node Feature Discovery checks

Verify the Node Feature Discovery has been created:
```
$ oc get NodeFeatureDiscovery -n openshift-nfd
```
```
NAME           AGE
nfd-instance   4h11m
```
Note

If empty the Node Feature Discovery Custom Resource (CR) must be created.

Check there are nodes with GPU. In this example the check is performed for the NVIDIA GPU which uses the PCI ID 10de.

$ oc get nodes -l feature.node.kubernetes.io/pci-10de.present

NAME                           STATUS   ROLES    AGE     VERSION
ip-10-0-133-209.ec2.internal   Ready    worker   4h13m   v1.21.1+9807387

GPU Operator checks

Check the Custom Resource Definition (CRD) is deployed.

$ oc get crd/clusterpolicies.nvidia.com

NAME                         CREATED AT
clusterpolicies.nvidia.com   2021-09-02T10:33:50Z

Note

If missing, then the OperatorHub install went wrong.

Check the cluster policy is deployed:
```
$ oc get clusterpolicy
```
```
NAME                  AGE
gpu-cluster-policy    8m25s
```
Note

If missing, the custom resource (CR) must be created from the OperatorHub.

Check that the Operator is running:

$ oc get pods -n openshift-operators -lapp=gpu-operator

gpu-operator-6b8b8c5fd9-zcs9r   1/1     Running   0          3h55m

Note

If ImagePullBackOff is reported, maybe the NVIDIA registry is down. If CrashLoopBackOff is reported then the operator logs can be reviewed:

$ oc logs -f -n openshift-operators -lapp=gpu-operator

Validate the GPU stack

The GPU Operator validates the stack through the nvidia-device-plugin-validator and the nvidia-cuda-validator pod. If both completed successfully, the stack works as expected.

$ oc get po -n gpu-operator-resources

NAME                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-kfmcm                1/1     Running     0          4h14m
nvidia-container-toolkit-daemonset-t5vgq   1/1     Running     0          4h14m
nvidia-cuda-validator-2wjlm                0/1     Completed   0          97m
nvidia-dcgm-exporter-tsjk7                 1/1     Running     0          4h14m
nvidia-dcgm-r7qbd                          1/1     Running     0          4h14m
nvidia-device-plugin-daemonset-zlchl       1/1     Running     0          4h14m
nvidia-device-plugin-validator-76pts       0/1     Completed   0          96m
nvidia-driver-daemonset-6zk6b              1/1     Running     32         4h14m
nvidia-node-status-exporter-27jdc          1/1     Running     1          4h14m
nvidia-operator-validator-cjsw7            1/1     Running     0          4h14m

Check the cuda validator logs:

$ oc logs -f nvidia-cuda-validator-2wjlm -n gpu-operator-resources

cuda workload validation is successful

Check the nvidia-device-plugin-validator logs:

$ oc logs nvidia-device-plugin-validator-76pts -n gpu-operator-resources | tail

device-plugin workload validation is successful

Check the NVIDIA driver deployment

This is an illustrated example of a situation where the deployment of the Operator is not proceeding as expected.

Check the pods deployed to the gpu-operator-resources namespace:

$ oc get pods -n gpu-operator-resources

NAME                                       READY   STATUS             RESTARTS          AGE
gpu-feature-discovery-kfmcm                0/1     Init:0/1           0          53m
nvidia-container-toolkit-daemonset-t5vgq   0/1     Init:0/1           0          53m
nvidia-dcgm-exporter-tsjk7                 0/1     Init:0/2           0          53m
nvidia-dcgm-r7qbd                          0/1     Init:0/1           0          53m
nvidia-device-plugin-daemonset-zlchl       0/1     Init:0/1           0          53m
nvidia-driver-daemonset-6zk6b              0/1     CrashLoopBackOff   13         53m
nvidia-node-status-exporter-27jdc          1/1     Running            0          53m
nvidia-operator-validator-cjsw7            0/1     Init:0/4           0          53m

The Init status indicates the driver pod is not ready. In this example the driver Pod is in state CrashLoopBackOff. This combined with the RESTARTS equal to 13 indicates a problem.

Check the main console page:

The first alert shows that the “nvidia driver could not be deployed”.

Note

Alerts are automatically enabled and logged in the console. For more information on alerts see, the OpenShift Container Platform documentation.

Check the NVIDIA driver logs:

$ oc logs -f nvidia-driver-daemonset-6zk6b -n gpu-operator-resources

+ echo 'Installing elfutils...'
Installing elfutils...
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
Error: Unable to find a match: elfutils-libelf-devel.x86_64
++ rm -rf /tmp/tmp.3jt46if6eF
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
  Stopping NVIDIA persistence daemon...

In the logs this line below indicates there is an entitlement issue:

+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
Error: Unable to find a match: elfutils-libelf-devel.x86_64

This error indicates that the UBI-based driver pod does not have subscription entitlements correctly mounted so that additional required UBI packages are not found. Please refer to this section Obtaining an entitlement certificate.