Troubleshooting#
This section describes how to troubleshoot the MNNVL rack.
Troubleshooting the Compute Tray#
This section provides information about troubleshooting the compute tray.
Driver Not Loading#
If you run nvidia-smi
and it shows No devices were found
, the GPU driver isn’t loading correctly.
Collect the dmesg
output and check for XID errors in the log.
Troubleshooting the Switch Tray#
This section provides information about troubleshooting the switch tray.
Note
The NVOS User Manual contains additional troubleshooting information.
Switch Health#
If the nv show system health
status is FATAL
, reboot the switch tray.
nv action reboot system

Figure 13 Fatal system health#
If the error persists, contact NVIDIA.
BMC Firmware Issues#
If the BMC displays an error in nv show platform firmware
, AC power-cycle the switch tray using the following command.
nv action power-cycle system
Note
The reboot command only reboots the COMe and not the switch BMC, whereas the power-cycle command performs a full AC power-cycle without gracefully shutting down the system.
Troubleshooting the NVLink Fabric#
This section provides information about troubleshooting the NVLink fabric.
NMX-Controller on Multiple Trays#
The NVLink fabric requires the GPUs on each compute node to communicate with all the NVLink switches. Only one NVLink switch runs the NMX-C services.
Running NMX-C on more than one NVLink switch isn’t supported.
Viewing Subnet Manager Logs#
You can find the Subnet Manager (SM) logs in the switch tray that runs NMX-C:
/var/log/nmx/nmx-c/nvlsm.log
Viewing Global Fabric Manager Logs#
You can find the Global Fabric Manager (GFM) logs in the switch tray that runs NMX-C:
/var/log/nmx/nmx-c/fabricmanager.log
Viewing IMEX Logs#
IMEX provides GPU-to-GPU memory management, and the mpi_memcpy
application requires IMEX.
To view the IMEX logs on each compute node in the /var/log/nvidia-imex.log
file, run the following command.
cat /var/log/nvidia-imex.log
[Aug 05 2024 21:27:19] [INFO] [tid 2248] Node configuration validation from NODE 8 successfully matched this node's configuration.
[Aug 05 2024 21:27:19] [INFO] [tid 2248] Node configuration validation from NODE 1 successfully matched this node's configuration.
[Aug 05 2024 21:27:20] [INFO] [tid 2248] Node configuration validation from NODE 15 successfully matched this node's configuration.
[Aug 05 2024 21:27:24] [INFO] [tid 2248] Node configuration validation from NODE 13 successfully matched this node's configuration.
[Aug 05 2024 21:27:24] [WARNING] [tid 2242] Waiting for validation response for 45 seconds from the following nodes:
[Aug 05 2024 21:27:24] [WARNING] [tid 2242] 0 - 10.28.238.87
[Aug 05 2024 21:27:24] [WARNING] [tid 2242] 7 - 10.28.238.161
[Aug 05 2024 21:27:24] [WARNING] [tid 2242] 10 - 10.28.238.192
[Aug 05 2024 21:27:24] [WARNING] [tid 2242] 12 - 10.28.238.211
[Aug 05 2024 21:27:24] [WARNING] [tid 2242] 14 - 10.28.238.239
[Aug 05 2024 21:27:25] [INFO] [tid 2248] Node configuration validation from NODE 12 successfully matched this node's configuration.
[Aug 05 2024 21:27:29] [WARNING] [tid 2242] Waiting for validation response for 50 seconds from the following nodes:
[Aug 05 2024 21:27:29] [WARNING] [tid 2242] 0 - 10.28.238.87
[Aug 05 2024 21:27:29] [WARNING] [tid 2242] 7 - 10.28.238.161
[Aug 05 2024 21:27:29] [WARNING] [tid 2242] 10 - 10.28.238.192
[Aug 05 2024 21:27:29] [WARNING] [tid 2242] 14 - 10.28.238.239
[Aug 05 2024 21:27:29] [INFO] [tid 2248] Node configuration validation from NODE 14 successfully matched this node's configuration.
[Aug 05 2024 21:27:30] [INFO] [tid 2248] Node configuration validation from NODE 7 successfully matched this node's configuration.
[Aug 05 2024 21:27:34] [WARNING] [tid 2242] Waiting for validation response for 55 seconds from the following nodes:
[Aug 05 2024 21:27:34] [WARNING] [tid 2242] 0 - 10.28.238.87
[Aug 05 2024 21:27:34] [WARNING] [tid 2242] 10 - 10.28.238.192
[Aug 05 2024 21:27:34] [INFO] [tid 2248] Node configuration validation from NODE 0 successfully matched this node's configuration.
[Aug 05 2024 21:27:37] [INFO] [tid 2248] Node configuration validation from NODE 10 successfully matched this node's configuration.
[Aug 05 2024 21:27:37] [INFO] [tid 2242] Node map validation complete.
[Aug 05 2024 21:27:37] [INFO] [tid 2242] GPU event successfully subscribed
The message GPU event successfully subscribed
indicates that IMEX has successfully registered with all other cluster GPUs.
IMEX TRANSIENT_FAILURE State#
IMEX operates as a distributed service that runs on each compute node. If a single compute tray isn’t running the IMEX, the IMEX service on other compute trays never reaches the READY
state.
Look at the IMEX log, /var/log/nvidia-imex.log
for messages like State = TRANSIENT_FAILURE (NOT OK)
or Not all clients connected. Waiting and retrying.
.
Restart the nvidia-imex
service on all compute nodes to try and rebuild the IMEX cluster.