Overview#
The NVIDIA® NVDebug tool (also known as nvdebug) can be deployed on server platforms or accessed remotely from client machines. This tool collects a variety of data to aid in troubleshooting server issues, including:
Out-of-band (OOB) Baseboard Management Controller (BMC)-based logs and diagnostic information.
Logs gathered directly from the host system.
This user guide covers the following topics:
An overview of nvdebug.
System requirements and dependencies.
Usage instructions.
Summary reporting.
Guidance on log analysis.
Supported features and validation suites.
Note
nvdebug currently supports specific NVIDIA platforms. NVIDIA will continue to expand its coverage, adding support for additional products over time.
Supported Products#
Below is a list of the currently supported NVIDIA platforms:
NVIDIA® DGX™ H100
NVIDIA HGX™ H100 8-GPU Baseboard
NVIDIA HGX™ H200 8-GPU Baseboard
NVIDIA HGX™ H800 8-GPU Baseboard
NVIDIA HGX™ B200 8-GPU Baseboard
NVIDIA HGX™ B300
NVIDIA MGX™ GH200
NVIDIA GB200 NVL
NVIDIA GB200 NVL NVSwitchTray
NVIDIA GB300 NVL
NVIDIA GB300 NVL NVSwitchTray
NVIDIA MGX™ C2
Terminology#
BMC: Baseboard Management Controller, a specialized service processor that provides out-of-band management capabilities for server hardware.
CID: Collector ID, a unique identifier for each log collector (for example, R1, I1, S1, H1).
CPER: Common Platform Error Record, a standardized format for reporting hardware errors.
DUT: Device Under Test, the system being tested.
ERoT/IRoT: External/Internal Root of Trust, security components used for secure boot and firmware verification.
FDR: First Data Record, a record of system events and errors.
FPGA: Field-Programmable Gate Array, a programmable logic device used in some NVIDIA systems for hardware management.
FRU: Field Replaceable Unit, hardware components that can be replaced in the field, such as CPUs, memory modules, or storage devices.
GNMI: gNMI (gRPC Network Management Interface), a protocol used for network device management.
HMC: Hardware Management Controller, a specialized controller used in HGX systems for hardware management and monitoring.
I2C: Inter-Integrated Circuit, a serial communication protocol used for connecting low-speed peripherals.
IPMI: Intelligent Platform Management Interface, a standardized protocol for monitoring and managing server hardware.
Log Collector: A defined type of log collection process.
Refer to “Appendix A: Use Cases, IPMI, SSH APIs, and SSH Commands” for more information.
Log Collector Group: A grouping mechanism for log collections, such as Redfish, IPMI, SSH, Host, and HealthCheck.
Each Log Collector Group contains at least one Log Collector ID.
Log Collector ID: A unique identifier assigned to each log collection instance.
NodeType: The type of node in a system, which can be one of the following:
Compute
SwitchTray
PowerShelf
NVLink: NVIDIA’s high-speed interconnect technology for GPU-to-GPU communication.
NVL: NVLink, referring to systems that use NVLink technology for high-speed interconnects.
NVOS: NVIDIA Operating System, which is used in NVSwitch systems.
OEM/ODM: Original Equipment Manufacturer/Original Design Manufacturer, companies that manufacture or design products for other companies.
OOB: Out-of-Band, refers to management and monitoring capabilities that operate independently of the main system’s operating system.
PCIe: Peripheral Component Interconnect Express, a high-speed serial computer expansion bus standard.
PLDM: Platform Level Data Model, a standard for platform management.
Redfish: A standard RESTful API to manage server hardware and provides a modern way to manage and monitor server infrastructure.
SDR: Sensor Data Record, a record of sensor information and thresholds in the system.
SEL: System Event Log, a record of system events and errors stored by the BMC.
SMBIOS: System Management BIOS, a standard for delivering management information using the BIOS.
SPDM: Security Protocol and Data Model, a standard for secure device communication.
Target Platform Type: The expected categories of target platforms include:
NVIDIA HGX-HMC
DGX
x86_64
arm64
NVSwitch
PowerShelf