Overview#

The NVIDIA® NVDebug tool (also known as nvdebug) can be deployed on server platforms or accessed remotely from client machines. This tool collects a variety of data to aid in troubleshooting server issues, including:

  • Out-of-band (OOB) Baseboard Management Controller (BMC)-based logs and diagnostic information.

  • Logs gathered directly from the host system.

This user guide covers the following topics:

  • An overview of nvdebug.

  • System requirements and dependencies.

  • Usage instructions.

  • Summary reporting.

  • Guidance on log analysis.

  • Supported features and validation suites.

Note

nvdebug currently supports specific NVIDIA platforms. NVIDIA will continue to expand its coverage, adding support for additional products over time.

Supported Products#

Below is a list of the currently supported NVIDIA platforms:

  • NVIDIA® DGX™ H100

  • NVIDIA HGX™ H100 8-GPU Baseboard

  • NVIDIA HGX™ H200 8-GPU Baseboard

  • NVIDIA HGX™ H800 8-GPU Baseboard

  • NVIDIA HGX™ B200 8-GPU Baseboard

  • NVIDIA HGX™ B300

  • NVIDIA MGX™ GH200

  • NVIDIA GB200 NVL

  • NVIDIA GB200 NVL NVSwitchTray

  • NVIDIA GB300 NVL

  • NVIDIA GB300 NVL NVSwitchTray

  • NVIDIA MGX™ C2

Terminology#

  • BMC: Baseboard Management Controller, a specialized service processor that provides out-of-band management capabilities for server hardware.

  • CID: Collector ID, a unique identifier for each log collector (for example, R1, I1, S1, H1).

  • CPER: Common Platform Error Record, a standardized format for reporting hardware errors.

  • DUT: Device Under Test, the system being tested.

  • ERoT/IRoT: External/Internal Root of Trust, security components used for secure boot and firmware verification.

  • FDR: First Data Record, a record of system events and errors.

  • FPGA: Field-Programmable Gate Array, a programmable logic device used in some NVIDIA systems for hardware management.

  • FRU: Field Replaceable Unit, hardware components that can be replaced in the field, such as CPUs, memory modules, or storage devices.

  • GNMI: gNMI (gRPC Network Management Interface), a protocol used for network device management.

  • HMC: Hardware Management Controller, a specialized controller used in HGX systems for hardware management and monitoring.

  • I2C: Inter-Integrated Circuit, a serial communication protocol used for connecting low-speed peripherals.

  • IPMI: Intelligent Platform Management Interface, a standardized protocol for monitoring and managing server hardware.

  • Log Collector: A defined type of log collection process.

Refer to “Appendix A: Use Cases, IPMI, SSH APIs, and SSH Commands” for more information.

  • Log Collector Group: A grouping mechanism for log collections, such as Redfish, IPMI, SSH, Host, and HealthCheck.

Each Log Collector Group contains at least one Log Collector ID.

  • Log Collector ID: A unique identifier assigned to each log collection instance.

  • NodeType: The type of node in a system, which can be one of the following:

    • Compute

    • SwitchTray

    • PowerShelf

  • NVLink: NVIDIA’s high-speed interconnect technology for GPU-to-GPU communication.

  • NVL: NVLink, referring to systems that use NVLink technology for high-speed interconnects.

  • NVOS: NVIDIA Operating System, which is used in NVSwitch systems.

  • OEM/ODM: Original Equipment Manufacturer/Original Design Manufacturer, companies that manufacture or design products for other companies.

  • OOB: Out-of-Band, refers to management and monitoring capabilities that operate independently of the main system’s operating system.

  • PCIe: Peripheral Component Interconnect Express, a high-speed serial computer expansion bus standard.

  • PLDM: Platform Level Data Model, a standard for platform management.

  • Redfish: A standard RESTful API to manage server hardware and provides a modern way to manage and monitor server infrastructure.

  • SDR: Sensor Data Record, a record of sensor information and thresholds in the system.

  • SEL: System Event Log, a record of system events and errors stored by the BMC.

  • SMBIOS: System Management BIOS, a standard for delivering management information using the BIOS.

  • SPDM: Security Protocol and Data Model, a standard for secure device communication.

  • Target Platform Type: The expected categories of target platforms include:

    • NVIDIA HGX-HMC

    • DGX

    • x86_64

    • arm64

    • NVSwitch

    • PowerShelf