Troubleshooting#

This section describes how to troubleshoot the MNNVL rack.

Troubleshooting the Compute Tray#

This section provides information about troubleshooting the compute tray.

Driver Not Loading#

If you run nvidia-smi and it shows No devices were found, the GPU driver isn’t loading correctly.

Collect the dmesg output and check for XID errors in the log.

Troubleshooting the Switch Tray#

This section provides information about troubleshooting the switch tray.

Note

The NVOS User Manual contains additional troubleshooting information.

Switch Health#

If the nv show system health status is FATAL, reboot the switch tray.

nv action reboot system

Fatal system health

Figure 13 Fatal system health#

If the error persists, contact NVIDIA.

BMC Firmware Issues#

If the BMC displays an error in nv show platform firmware, AC power-cycle the switch tray using the following command.

nv action power-cycle system

Note

The reboot command only reboots the COMe and not the switch BMC, whereas the power-cycle command performs a full AC power-cycle without gracefully shutting down the system.