Troubleshootings

infoROM is corrupted (nvidia-smi return code 14)

Issue:

nvidia-operator-validator fails and nvidia-driver-daemonsets fails as well.

Observation:

Output from kubectl logs -n gpu-operator nvidia-operator-validator-xxxxx -c driver-validation:

| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   42C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:0B:00.0
14

The GPU emits some warning messages related to infoROM.

Note:

possible return value for nvidia-smi is listed below (reference: nvidia-smi specification):

RETURN VALUE

Return code reflects whether the operation succeeded or failed and what
was the reason of failure.

·      Return code 0 - Success

·      Return code 2 - A supplied argument or flag is invalid
·      Return code 3 - The requested operation is not available on target device
·      Return code 4 - The current user does  not  have permission  to access this device or perform this operation
·      Return code 6 - A query to find an object was unsuccessful
·      Return code 8 - A device's external power cables are not properly attached
·      Return code 9 - NVIDIA driver is not loaded
·      Return code 10 - NVIDIA Kernel detected an interrupt issue  with a GPU
·      Return code 12 - NVML Shared Library couldn't be found or loaded
·      Return code 13 - Local version of NVML  doesn't  implement  this function
·      Return code 14 - infoROM is corrupted
·      Return code 15 - The GPU has fallen off the bus or has otherwise become inaccessible
·      Return code 255 - Other error or internal driver error occurred

Root cause:

nvidi-smi should return a success code (Return code 0) for driver-validator to pass and GPU operator to successfully deploy driver pod on the node.

Action:

replace the faulty GPU

EFI + Secure Boot

Issue: GPU Driver pod fails to deploy

Root cause: EFI Secure Boot is currently not supported with GPU Operator

Action: Disable EFI Secure Boot on the server