Troubleshooting the NVIDIA GPU Operator

GPU Operator Validator: Failed to Create Pod Sandbox

Issue

On some occasions, the driver container is unable to unload the nouveau Linux kernel module.

Observation

  • Running kubectl describe pod -n gpu-operator -l app=nvidia-operator-validator includes the following event:

    Events:
      Type     Reason                  Age                 From     Message
      ----     ------                  ----                ----     -------
      Warning  FailedCreatePodSandBox  8s (x21 over 9m2s)  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
    
  • Running one of the following commands on the node indicates that the nouveau Linux kernel module is loaded:

    $ lsmod | grep -i nouveau
    $ dmesg | grep -i nouveau
    $ journalctl -xb | grep -i nouveau
    

Root Cause

The nouveau Linux kernel module is loaded and the driver container is unable to unload the module. Because the nouveau module is loaded, the driver container cannot load the nvidia module.

Action

On each node, run the following commands to prevent loading the nouveau Linux kernel module on boot:

$ sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
    && sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
    && sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"

$ sudo update-initramfs -u

$ sudo init 6

No GPU Driver or Operand Pods Running

Issue

On some clusters, taints are applied to nodes with a taint effect of NoSchedule.

Observation

  • Running kubectl get ds -n gpu-operator shows 0 for DESIRED, CURRENT, READY and so on.

    NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
    gpu-feature-discovery             0         0         0       0            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                      11m
    ...
    

Root Cause

The NoSchedule taint prevents the Operator from deploying the GPU Driver and other Operand pods.

Action

Describe each node, identify the taints, and either remove the taints from the nodes or add the taints as tolerations to the daemon sets.

GPU Operator Pods Stuck in Crash Loop

Issue

On large clusters, such as 300 or more nodes, the GPU Operator pods can get stuck in a crash loop.

Observation

  • The GPU Operator pod is not running:

    $ kubectl get pod -n gpu-operator -l app=gpu-operator
    

    Example Output

    NAME                            READY   STATUS             RESTARTS      AGE
    gpu-operator-568c7ff7f6-chg5b   0/1     CrashLoopBackOff   4 (85s ago)   4m42s
    
  • The node that is running the GPU Operator pod has sufficient resources and the node is Ready:

    $ kubectl describe node <node-name>
    

    Example Output

    Conditions:
      Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
      ----                 ------  -----------------                 ------------------                ------                       -------
      MemoryPressure       False   Tue, 26 Dec 2023 14:01:31 +0000   Tue, 12 Dec 2023 19:47:47 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
      DiskPressure         False   Tue, 26 Dec 2023 14:01:31 +0000   Thu, 14 Dec 2023 19:15:03 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
      PIDPressure          False   Tue, 26 Dec 2023 14:01:31 +0000   Tue, 12 Dec 2023 19:47:47 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
      Ready                True    Tue, 26 Dec 2023 14:01:31 +0000   Thu, 14 Dec 2023 19:15:13 +0000   KubeletReady                 kubelet is posting ready status
    

Root Cause

The memory resource limit for the GPU Operator is too low for the cluster size.

Action

Increase the memory request and limit for the GPU Operator pod:

  • Set the memory request to a value that matches the average memory consumption over an large time window.

  • Set the memory limit to match the spikes in memory consumption that occur occasionally.

  1. Increase the memory resource limit for the GPU Operator pod:

    $ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \
        -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"1400Mi"}]'
    
  2. Optional: Increase the memory resource request for the pod:

    $ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \
        -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/requests/memory", "value":"600Mi"}]'
    

Monitor the GPU Operator pod. Increase the memory request and limit again if the pod remains stuck in a crash loop.

infoROM is corrupted (nvidia-smi return code 14)

Issue

The nvidia-operator-validator pod fails and nvidia-driver-daemonsets fails as well.

Observation

The output from the driver validation container indicates that the infoROM is corrupt:

$ kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation

Example Output

| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   42C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14

The GPU emits some warning messages related to infoROM. The return values for the nvidia-smi command are listed below.

RETURN VALUE

Return code reflects whether the operation succeeded or failed and what
was the reason of failure.

·      Return code 0 - Success

·      Return code 2 - A supplied argument or flag is invalid
·      Return code 3 - The requested operation is not available on target device
·      Return code 4 - The current user does  not  have permission  to access this device or perform this operation
·      Return code 6 - A query to find an object was unsuccessful
·      Return code 8 - A device's external power cables are not properly attached
·      Return code 9 - NVIDIA driver is not loaded
·      Return code 10 - NVIDIA Kernel detected an interrupt issue  with a GPU
·      Return code 12 - NVML Shared Library couldn't be found or loaded
·      Return code 13 - Local version of NVML  doesn't  implement  this function
·      Return code 14 - infoROM is corrupted
·      Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
·      Return code 255 - Other error or internal driver error occurred

Root Cause

The nvidi-smi command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node.

Action

Replace the faulty GPU.

EFI + Secure Boot

Issue

GPU Driver pod fails to deploy.

Root Cause

EFI Secure Boot is currently not supported with GPU Operator

Action

Disable EFI Secure Boot on the server.