Troubleshooting the NVIDIA GPU Operator
GPU Operator Validator: Failed to Create Pod Sandbox
Issue
On some occasions, the driver container is unable to unload the nouveau
Linux kernel module.
Observation
Running
kubectl describe pod -n gpu-operator -l app=nvidia-operator-validator
includes the following event:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 8s (x21 over 9m2s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Running one of the following commands on the node indicates that the
nouveau
Linux kernel module is loaded:$ lsmod | grep -i nouveau $ dmesg | grep -i nouveau $ journalctl -xb | grep -i nouveau
Root Cause
The nouveau
Linux kernel module is loaded and the driver container is unable to unload the module.
Because the nouveau
module is loaded, the driver container cannot load the nvidia
module.
Action
On each node, run the following commands to prevent loading the nouveau
Linux kernel module on boot:
$ sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
&& sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
&& sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"
$ sudo update-initramfs -u
$ sudo init 6
No GPU Driver or Operand Pods Running
Issue
On some clusters, taints are applied to nodes with a taint effect of NoSchedule
.
Observation
Running
kubectl get ds -n gpu-operator
shows0
forDESIRED
,CURRENT
,READY
and so on.NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpu-feature-discovery 0 0 0 0 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 11m ...
Root Cause
The NoSchedule
taint prevents the Operator from deploying the GPU Driver and other Operand pods.
Action
Describe each node, identify the taints, and either remove the taints from the nodes or add the taints as tolerations to the daemon sets.
GPU Operator Pods Stuck in Crash Loop
Issue
On large clusters, such as 300 or more nodes, the GPU Operator pods can get stuck in a crash loop.
Observation
The GPU Operator pod is not running:
$ kubectl get pod -n gpu-operator -l app=gpu-operator
Example Output
NAME READY STATUS RESTARTS AGE gpu-operator-568c7ff7f6-chg5b 0/1 CrashLoopBackOff 4 (85s ago) 4m42s
The node that is running the GPU Operator pod has sufficient resources and the node is
Ready
:$ kubectl describe node <node-name>
Example Output
Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:03 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 26 Dec 2023 14:01:31 +0000 Tue, 12 Dec 2023 19:47:47 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 26 Dec 2023 14:01:31 +0000 Thu, 14 Dec 2023 19:15:13 +0000 KubeletReady kubelet is posting ready status
Root Cause
The memory resource limit for the GPU Operator is too low for the cluster size.
Action
Increase the memory request and limit for the GPU Operator pod:
Set the memory request to a value that matches the average memory consumption over an large time window.
Set the memory limit to match the spikes in memory consumption that occur occasionally.
Increase the memory resource limit for the GPU Operator pod:
$ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \ -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/limits/memory", "value":"1400Mi"}]'
Optional: Increase the memory resource request for the pod:
$ kubectl patch deployment gpu-operator -n gpu-operator --type='json' \ -p='[{"op":"replace", "path":"/spec/template/spec/containers/0/resources/requests/memory", "value":"600Mi"}]'
Monitor the GPU Operator pod. Increase the memory request and limit again if the pod remains stuck in a crash loop.
infoROM is corrupted (nvidia-smi return code 14)
Issue
The nvidia-operator-validator pod fails and nvidia-driver-daemonsets fails as well.
Observation
The output from the driver validation container indicates that the infoROM is corrupt:
$ kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation
Example Output
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:0B:00.0 Off | 0 |
| N/A 42C P0 29W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14
The GPU emits some warning messages related to infoROM.
The return values for the nvidia-smi
command are listed below.
RETURN VALUE
Return code reflects whether the operation succeeded or failed and what
was the reason of failure.
· Return code 0 - Success
· Return code 2 - A supplied argument or flag is invalid
· Return code 3 - The requested operation is not available on target device
· Return code 4 - The current user does not have permission to access this device or perform this operation
· Return code 6 - A query to find an object was unsuccessful
· Return code 8 - A device's external power cables are not properly attached
· Return code 9 - NVIDIA driver is not loaded
· Return code 10 - NVIDIA Kernel detected an interrupt issue with a GPU
· Return code 12 - NVML Shared Library couldn't be found or loaded
· Return code 13 - Local version of NVML doesn't implement this function
· Return code 14 - infoROM is corrupted
· Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
· Return code 255 - Other error or internal driver error occurred
Root Cause
The nvidi-smi
command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node.
Action
Replace the faulty GPU.
EFI + Secure Boot
Issue
GPU Driver pod fails to deploy.
Root Cause
EFI Secure Boot is currently not supported with GPU Operator
Action
Disable EFI Secure Boot on the server.