Troubleshooting the NVIDIA GPU Operator
GPU Operator Pods Stuck in Crash Loop
Issue
On large clusters, such as 300 or more nodes, the GPU Operator pods can get stuck in a crash loop.
Observation
- The GPU Operator pod is not running: - $ kubectl get pod -n gpu-operator -l app=gpu-operator - Example Output - NAME READY STATUS RESTARTS AGE gpu-operator-568c7ff7f6-chg5b 0/1 CrashLoopBackOff 4 (85s ago) 4m42s 
- The node that is running the GPU Operator pod has sufficient resources and the node is - Ready:- $ kubectl describe node <node-name>- Example Output - Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:03 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:13 +0000 KubeletReady kubelet is posting ready status 
Root Cause
The memory resource limit for the GPU Operator is too low for the cluster size.
Action
Increase the memory request and limit for the GPU Operator pod:
- Set the memory request to a value that matches the average memory consumption over an large time window. 
- Set the memory limit to match the spikes in memory consumption that occur occasionally. 
- Increase the memory resource limit for the GPU Operator pod: - $ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \ -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"1400Mi"}]' 
- Optional: Increase the memory resource request for the pod: - $ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \ -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/requests/memory", "value":"600Mi"}]' 
Monitor the GPU Operator pod. Increase the memory request and limit again if the pod remains stuck in a crash loop.
infoROM is corrupted (nvidia-smi return code 14)
Issue
The nvidia-operator-validator pod fails and nvidia-driver-daemonsets fails as well.
Observation
The output from the driver validation container indicates that the infoROM is corrupt:
$ kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation
Example Output
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   42C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14
The GPU emits some warning messages related to infoROM.
The return values for the nvidia-smi command are listed below.
RETURN VALUE
Return code reflects whether the operation succeeded or failed and what
was the reason of failure.
·      Return code 0 - Success
·      Return code 2 - A supplied argument or flag is invalid
·      Return code 3 - The requested operation is not available on target device
·      Return code 4 - The current user does  not  have permission  to access this device or perform this operation
·      Return code 6 - A query to find an object was unsuccessful
·      Return code 8 - A device's external power cables are not properly attached
·      Return code 9 - NVIDIA driver is not loaded
·      Return code 10 - NVIDIA Kernel detected an interrupt issue  with a GPU
·      Return code 12 - NVML Shared Library couldn't be found or loaded
·      Return code 13 - Local version of NVML  doesn't  implement  this function
·      Return code 14 - infoROM is corrupted
·      Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
·      Return code 255 - Other error or internal driver error occurred
Root Cause
The nvidi-smi command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node.
Action
Replace the faulty GPU.
EFI + Secure Boot
Issue
GPU Driver pod fails to deploy.
Root Cause
EFI Secure Boot is currently not supported with GPU Operator
Action
Disable EFI Secure Boot on the server.