Overview#
The NVIDIA® NVDebug tool (also known as nvdebug) can be deployed on server platforms or accessed remotely from client machines. This tool collects a variety of data to aid in troubleshooting server issues, including:
- Out-of-band (OOB) Baseboard Management Controller (BMC)-based logs and diagnostic information. 
- Logs gathered directly from the host system. 
This user guide covers the following topics:
- An overview of nvdebug. 
- System requirements and dependencies. 
- Usage instructions. 
- Summary reporting. 
- Guidance on log analysis. 
- Supported features and validation suites. 
Note
nvdebug currently supports specific NVIDIA platforms. NVIDIA will continue to expand its coverage, adding support for additional products over time.
Supported Products#
Below is a list of the currently supported NVIDIA platforms:
- NVIDIA® DGX™ H100 
- NVIDIA HGX™ H100 8-GPU Baseboard 
- NVIDIA HGX™ H200 8-GPU Baseboard 
- NVIDIA HGX™ H800 8-GPU Baseboard 
- NVIDIA HGX™ B200 8-GPU Baseboard 
- NVIDIA HGX™ B300 
- NVIDIA MGX™ GH200 
- NVIDIA GB200 NVL 
- NVIDIA GB200 NVL NVSwitchTray 
- NVIDIA GB300 NVL 
- NVIDIA GB300 NVL NVSwitchTray 
- NVIDIA MGX™ C2 
Terminology#
- BMC: Baseboard Management Controller, a specialized service processor that provides out-of-band management capabilities for server hardware. 
- CID: Collector ID, a unique identifier for each log collector (for example, R1, I1, S1, H1). 
- CPER: Common Platform Error Record, a standardized format for reporting hardware errors. 
- DUT: Device Under Test, the system being tested. 
- ERoT/IRoT: External/Internal Root of Trust, security components used for secure boot and firmware verification. 
- FDR: First Data Record, a record of system events and errors. 
- FPGA: Field-Programmable Gate Array, a programmable logic device used in some NVIDIA systems for hardware management. 
- FRU: Field Replaceable Unit, hardware components that can be replaced in the field, such as CPUs, memory modules, or storage devices. 
- GNMI: gNMI (gRPC Network Management Interface), a protocol used for network device management. 
- HMC: Hardware Management Controller, a specialized controller used in HGX systems for hardware management and monitoring. 
- I2C: Inter-Integrated Circuit, a serial communication protocol used for connecting low-speed peripherals. 
- IPMI: Intelligent Platform Management Interface, a standardized protocol for monitoring and managing server hardware. 
- Log Collector: A defined type of log collection process. 
Refer to “Appendix A: Use Cases, IPMI, SSH APIs, and SSH Commands” for more information.
- Log Collector Group: A grouping mechanism for log collections, such as Redfish, IPMI, SSH, Host, and HealthCheck. 
Each Log Collector Group contains at least one Log Collector ID.
- Log Collector ID: A unique identifier assigned to each log collection instance. 
- NodeType: The type of node in a system, which can be one of the following: - Compute 
- SwitchTray 
- PowerShelf 
 
- NVLink: NVIDIA’s high-speed interconnect technology for GPU-to-GPU communication. 
- NVL: NVLink, referring to systems that use NVLink technology for high-speed interconnects. 
- NVOS: NVIDIA Operating System, which is used in NVSwitch systems. 
- OEM/ODM: Original Equipment Manufacturer/Original Design Manufacturer, companies that manufacture or design products for other companies. 
- OOB: Out-of-Band, refers to management and monitoring capabilities that operate independently of the main system’s operating system. 
- PCIe: Peripheral Component Interconnect Express, a high-speed serial computer expansion bus standard. 
- PLDM: Platform Level Data Model, a standard for platform management. 
- Redfish: A standard RESTful API to manage server hardware and provides a modern way to manage and monitor server infrastructure. 
- SDR: Sensor Data Record, a record of sensor information and thresholds in the system. 
- SEL: System Event Log, a record of system events and errors stored by the BMC. 
- SMBIOS: System Management BIOS, a standard for delivering management information using the BIOS. 
- SPDM: Security Protocol and Data Model, a standard for secure device communication. 
- Target Platform Type: The expected categories of target platforms include: - NVIDIA HGX-HMC 
- DGX 
- x86_64 
- arm64 
- NVSwitch 
- PowerShelf