Troubleshooting

This section includes errors that users may encounter when performing various checks during installing the NVIDIA GPU Operator on the OpenShift Container Platform cluster.

Node Feature Discovery checks

  1. Verify the Node Feature Discovery has been created:

    $ oc get NodeFeatureDiscovery -n openshift-nfd
    
    NAME           AGE
    nfd-instance   4h11m
    

    Note

    If empty the Node Feature Discovery Custom Resource (CR) must be created.

  2. Check there are nodes with GPU. In this example the check is performed for the NVIDIA GPU which uses the PCI ID 10de.

    $ oc get nodes -l feature.node.kubernetes.io/pci-10de.present
    
    NAME                           STATUS   ROLES    AGE     VERSION
    ip-10-0-133-209.ec2.internal   Ready    worker   4h13m   v1.21.1+9807387
    

GPU Operator checks

  1. Check the Custom Resource Definition (CRD) is deployed.

    $ oc get crd/clusterpolicies.nvidia.com
    
    NAME                         CREATED AT
    clusterpolicies.nvidia.com   2021-09-02T10:33:50Z
    

    Note

    If missing, then the OperatorHub install went wrong.

  2. Check the cluster policy is deployed:

    $ oc get clusterpolicy
    
    NAME                  AGE
    gpu-cluster-policy    8m25s
    

    Note

    If missing, the custom resource (CR) must be created from the OperatorHub.

  3. Check that the Operator is running:

    $ oc get pods -n openshift-operators -lapp=gpu-operator
    
    gpu-operator-6b8b8c5fd9-zcs9r   1/1     Running   0          3h55m
    

    Note

    If ImagePullBackOff is reported, maybe the NVIDIA registry is down. If CrashLoopBackOff is reported then the operator logs can be reviewed:

    $ oc logs -f -n openshift-operators -lapp=gpu-operator
    

Validate the GPU stack

The GPU Operator validates the stack through the nvidia-device-plugin-validator and the nvidia-cuda-validator pod. If both completed successfully, the stack works as expected.

$ oc get po -n gpu-operator-resources
NAME                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-kfmcm                1/1     Running     0          4h14m
nvidia-container-toolkit-daemonset-t5vgq   1/1     Running     0          4h14m
nvidia-cuda-validator-2wjlm                0/1     Completed   0          97m
nvidia-dcgm-exporter-tsjk7                 1/1     Running     0          4h14m
nvidia-dcgm-r7qbd                          1/1     Running     0          4h14m
nvidia-device-plugin-daemonset-zlchl       1/1     Running     0          4h14m
nvidia-device-plugin-validator-76pts       0/1     Completed   0          96m
nvidia-driver-daemonset-6zk6b              1/1     Running     32         4h14m
nvidia-node-status-exporter-27jdc          1/1     Running     1          4h14m
nvidia-operator-validator-cjsw7            1/1     Running     0          4h14m
  1. Check the cuda validator logs:

    $ oc logs -f nvidia-cuda-validator-2wjlm -n gpu-operator-resources
    
    cuda workload validation is successful
    
  2. Check the nvidia-device-plugin-validator logs:

    $ oc logs nvidia-device-plugin-validator-76pts -n gpu-operator-resources | tail
    
    device-plugin workload validation is successful
    

Check the NVIDIA driver deployment

This is an illustrated example of a situation where the deployment of the Operator is not proceeding as expected.

  1. Check the pods deployed to the gpu-operator-resources namespace:

    $ oc get pods -n gpu-operator-resources
    
    NAME                                       READY   STATUS             RESTARTS          AGE
    gpu-feature-discovery-kfmcm                0/1     Init:0/1           0          53m
    nvidia-container-toolkit-daemonset-t5vgq   0/1     Init:0/1           0          53m
    nvidia-dcgm-exporter-tsjk7                 0/1     Init:0/2           0          53m
    nvidia-dcgm-r7qbd                          0/1     Init:0/1           0          53m
    nvidia-device-plugin-daemonset-zlchl       0/1     Init:0/1           0          53m
    nvidia-driver-daemonset-6zk6b              0/1     CrashLoopBackOff   13         53m
    nvidia-node-status-exporter-27jdc          1/1     Running            0          53m
    nvidia-operator-validator-cjsw7            0/1     Init:0/4           0          53m
    

    The Init status indicates the driver pod is not ready. In this example the driver Pod is in state CrashLoopBackOff. This combined with the RESTARTS equal to 13 indicates a problem.

  2. Check the main console page:

    ../_images/ocp_main_console_alerts.png

    The first alert shows that the “nvidia driver could not be deployed”.

    Note

    Alerts are automatically enabled and logged in the console. For more information on alerts see, the OpenShift Container Platform documentation.

  3. Check the NVIDIA driver logs:

    $ oc logs -f nvidia-driver-daemonset-6zk6b -n gpu-operator-resources
    
    + echo 'Installing elfutils...'
    Installing elfutils...
    + dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
    Error: Unable to find a match: elfutils-libelf-devel.x86_64
    ++ rm -rf /tmp/tmp.3jt46if6eF
    + _shutdown
    + _unload_driver
    + rmmod_args=()
    + local rmmod_args
    + local nvidia_deps=0
    + local nvidia_refs=0
    + local nvidia_uvm_refs=0
    + local nvidia_modeset_refs=0
    + echo 'Stopping NVIDIA persistence daemon...'
      Stopping NVIDIA persistence daemon...
    

    In the logs this line below indicates there is an entitlement issue:

    + dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
    Error: Unable to find a match: elfutils-libelf-devel.x86_64
    

    This error indicates that the UBI-based driver pod does not have subscription entitlements correctly mounted so that additional required UBI packages are not found. Please refer to this section Obtaining an entitlement certificate.