Troubleshooting the NVIDIA GPU Operator

GPU Operator Pods Stuck in Crash Loop

Issue

On large clusters, such as 300 or more nodes, the GPU Operator pods can get stuck in a crash loop.

Observation

The GPU Operator pod is not running:

$ kubectl get pod -n gpu-operator -l app=gpu-operator

Example Output

NAME                            READY   STATUS             RESTARTS      AGE
gpu-operator-568c7ff7f6-chg5b   0/1     CrashLoopBackOff   4 (85s ago)   4m42s

The node that is running the GPU Operator pod has sufficient resources and the node is Ready:

$ kubectl describe node <node-name>

Example Output

Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  MemoryPressure       False   Tue, 26 Dec 2023 14:01:31 +0000   Tue, 12 Dec 2023 19:47:47 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 26 Dec 2023 14:01:31 +0000   Thu, 14 Dec 2023 19:15:03 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 26 Dec 2023 14:01:31 +0000   Tue, 12 Dec 2023 19:47:47 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 26 Dec 2023 14:01:31 +0000   Thu, 14 Dec 2023 19:15:13 +0000   KubeletReady                 kubelet is posting ready status

Root Cause

The memory resource limit for the GPU Operator is too low for the cluster size.

Action

Increase the memory request and limit for the GPU Operator pod:

Set the memory request to a value that matches the average memory consumption over an large time window.
Set the memory limit to match the spikes in memory consumption that occur occasionally.

Increase the memory resource limit for the GPU Operator pod:

$ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \
    -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"1400Mi"}]'

Optional: Increase the memory resource request for the pod:

$ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \
    -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/requests/memory", "value":"600Mi"}]'

Monitor the GPU Operator pod. Increase the memory request and limit again if the pod remains stuck in a crash loop.

infoROM is corrupted (nvidia-smi return code 14)

Issue

The nvidia-operator-validator pod fails and nvidia-driver-daemonsets fails as well.

Observation

The output from the driver validation container indicates that the infoROM is corrupt:

$ kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation

Example Output

| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   42C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14

The GPU emits some warning messages related to infoROM. The return values for the nvidia-smi command are listed below.

RETURN VALUE

Return code reflects whether the operation succeeded or failed and what
was the reason of failure.

Â·      Return code 0 - Success

Â·      Return code 2 - A supplied argument or flag is invalid
Â·      Return code 3 - The requested operation is not available on target device
Â·      Return code 4 - The current user does  not  have permission  to access this device or perform this operation
Â·      Return code 6 - A query to find an object was unsuccessful
Â·      Return code 8 - A device's external power cables are not properly attached
Â·      Return code 9 - NVIDIA driver is not loaded
Â·      Return code 10 - NVIDIA Kernel detected an interrupt issue  with a GPU
Â·      Return code 12 - NVML Shared Library couldn't be found or loaded
Â·      Return code 13 - Local version of NVML  doesn't  implement  this function
Â·      Return code 14 - infoROM is corrupted
Â·      Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
Â·      Return code 255 - Other error or internal driver error occurred

Root Cause

The nvidi-smi command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node.

Action

Replace the faulty GPU.

EFI + Secure Boot

Issue

GPU Driver pod fails to deploy.

Root Cause

EFI Secure Boot is currently not supported with GPU Operator

Action

Disable EFI Secure Boot on the server.