Troubleshooting the NVIDIA GPU Operator#

Pods stuck in Pending state in mixed MIG + full GPU environments#

Issue

For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01, GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. For more detailed information, see GitHub issue NVIDIA/gpu-operator#1361.

Observation

When a GPU pod is created on a node that has a mix of MIG slices and full GPUs, the GPU pod gets stuck indefinitely in the Pending state.

Root Cause

This is due to a regression in NVML introduced in the R570 drivers starting from 570.124.06.

Action

It’s recommended that you downgrade to driver version 570.86.15 to work around this issue.

GPU Operator Validator: Failed to Create Pod Sandbox#

Issue

On some occasions, the driver container is unable to unload the nouveau Linux kernel module.

Observation

Running kubectl describe pod -n gpu-operator -l app=nvidia-operator-validator includes the following event:

Events:
  Type     Reason                  Age                 From     Message
  ----     ------                  ----                ----     -------
  Warning  FailedCreatePodSandBox  8s (x21 over 9m2s)  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Running one of the following commands on the node indicates that the nouveau Linux kernel module is loaded:
```
$ lsmod | grep -i nouveau
$ dmesg | grep -i nouveau
$ journalctl -xb | grep -i nouveau
```

Root Cause

The nouveau Linux kernel module is loaded and the driver container is unable to unload the module. Because the nouveau module is loaded, the driver container cannot load the nvidia module.

Action

On each node, run the following commands to prevent loading the nouveau Linux kernel module on boot:

$ sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
    && sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
    && sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"

$ sudo update-initramfs -u

$ sudo init 6

No GPU Driver or Operand Pods Running#

Issue

On some clusters, taints are applied to nodes with a taint effect of NoSchedule.

Observation

Running kubectl get ds -n gpu-operator shows 0 for DESIRED, CURRENT, READY and so on.

NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
gpu-feature-discovery             0         0         0       0            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                      11m
...

Root Cause

The NoSchedule taint prevents the Operator from deploying the GPU Driver and other Operand pods.

Action

Describe each node, identify the taints, and either remove the taints from the nodes or add the taints as tolerations to the daemon sets.

GPU Operator Pods Stuck in Crash Loop#

Issue

On large clusters, such as 300 or more nodes, the GPU Operator pods can get stuck in a crash loop.

Observation

The GPU Operator pod is not running:

$ kubectl get pod -n gpu-operator -l app=gpu-operator

Example Output

NAME                            READY   STATUS             RESTARTS      AGE
gpu-operator-568c7ff7f6-chg5b   0/1     CrashLoopBackOff   4 (85s ago)   4m42s

The node that is running the GPU Operator pod has sufficient resources and the node is Ready:

$ kubectl describe node <node-name>

Example Output

Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  MemoryPressure       False   Tue, 26 Dec 2023 14:01:31 +0000   Tue, 12 Dec 2023 19:47:47 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 26 Dec 2023 14:01:31 +0000   Thu, 14 Dec 2023 19:15:03 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 26 Dec 2023 14:01:31 +0000   Tue, 12 Dec 2023 19:47:47 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 26 Dec 2023 14:01:31 +0000   Thu, 14 Dec 2023 19:15:13 +0000   KubeletReady                 kubelet is posting ready status

Root Cause

The memory resource limit for the GPU Operator is too low for the cluster size.

Action

Increase the memory request and limit for the GPU Operator pod:

Set the memory request to a value that matches the average memory consumption over an large time window.
Set the memory limit to match the spikes in memory consumption that occur occasionally.

Increase the memory resource limit for the GPU Operator pod:

$ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \
    -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"1400Mi"}]'

Optional: Increase the memory resource request for the pod:

$ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \
    -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/requests/memory", "value":"600Mi"}]'

Monitor the GPU Operator pod. Increase the memory request and limit again if the pod remains stuck in a crash loop.

infoROM is corrupted (nvidia-smi return code 14)#

Issue

The nvidia-operator-validator pod fails and nvidia-driver-daemonsets fails as well.

Observation

The output from the driver validation container indicates that the infoROM is corrupt:

$ kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation

Example Output

| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   42C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14

The GPU emits some warning messages related to infoROM. The return values for the nvidia-smi command are listed below.

RETURN VALUE

Return code reflects whether the operation succeeded or failed and what
was the reason of failure.

Â·      Return code 0 - Success

Â·      Return code 2 - A supplied argument or flag is invalid
Â·      Return code 3 - The requested operation is not available on target device
Â·      Return code 4 - The current user does  not  have permission  to access this device or perform this operation
Â·      Return code 6 - A query to find an object was unsuccessful
Â·      Return code 8 - A device's external power cables are not properly attached
Â·      Return code 9 - NVIDIA driver is not loaded
Â·      Return code 10 - NVIDIA Kernel detected an interrupt issue  with a GPU
Â·      Return code 12 - NVML Shared Library couldn't be found or loaded
Â·      Return code 13 - Local version of NVML  doesn't  implement  this function
Â·      Return code 14 - infoROM is corrupted
Â·      Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
Â·      Return code 255 - Other error or internal driver error occurred

Root Cause

The nvidi-smi command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node.

Action

Replace the faulty GPU.

EFI + Secure Boot#

Issue

GPU Driver pod fails to deploy.

Root Cause

EFI Secure Boot is currently not supported with GPU Operator

Action

Disable EFI Secure Boot on the server.