Document History#

DU-11585-001_1.3

Table 1 Table Title#

Version

Date

Description of Change

0.1

September 13, 2023

Initial draft.

0.2

September 26, 2023

Support for Out of Band (OOB) log collection and Host-based log collection for NVIDIA Grace™ Hopper™ platforms including NVIDIA MGX systems.

0.3

October 24, 2023

Support for OOB log collection and Host-based log collection for NVIDIA® HGX™-H100x8 and NVIDIA DGX™ H100.

0.4

November 27, 2023

  • Allows running custom list of logs and the listing of supported logs for each type of platform.

  • Allows a customizing output directory.

0.5

January 19, 2024

  • CLI and config file improvements.

  • Added the Out-of-Band Health Check.

1.0

March 6, 2024

  • IPv6 support and Log Sanitization.

  • CLI and config file updates.

1.1

April 12, 2024

  • Host log collection improvement for local and remote hosts.

  • Initial support for NVIDIA NVLink®-related logs.

  • Integrated nvidia-bug-report.sh.

1.2

June 28, 2024

  • Added more Redfish log collectors for HMC dumps.

  • Added an OOB collection script for HMC log collection that can run on the BMC.

1.2.1

July 22, 2024

Added a parse option to decode binary dumps in NVDebug log outputs.

1.3

October 11, 2024

  • Added support for the log collection of GB200 NVL (NVIDIA NVLink™ Switch tray and Multi-DUT for Rack level).

  • Updated the output log dump format.

  • Added new CIDs for software inventory, NVOS fabric manager, and subnet manager dumps, and so on.

  • Changed Platform naming, and the old names will be deprecated in v1.5.

  • Updated the Platforms section of the tables in Appendix A to reflect newer supported products.

1.4

December 16, 2024

  • Added support for the log collection of GB200 NVL (NVIDIA PowerShelf tray).

  • Added new CIDs for Subnet Manager, NVMe, HMC FDR, Post Codes, PowerShelf, OneDiag.

  • Added support for CID tag overriding.

  • Added support for passwordless ssh connectivity, DUT level archive zipping.

  • Updated CID R4 to allow capture on GB200 NVL

  • Updated CID S15 to enable configurable I2CTRANSFER_WRITE_TO_ADDRESS

1.5

February 21, 2025

  • Added automated archive splitting with the 200MB threshold.

  • Added configurable ports for BMC SSH (Default 22) and HMC TCP (Default 80).

  • Added a default Collector flag.

  • Added new CIDs for Network Devices (R36), GPU Devices (R37), and Host OS Information (H18).

  • Added support for HTML reports.

  • Enhanced collector H9 to better handle NVSwitch.

  • Enhanced parser error handling for .zip and .tar files.

  • Modified collector R29 to handle changes for the RoT Naming string.

  • Removed collectors H16, H17, and H18 from PowerShelf.

  • Split collector R35 for the network device debug dump into the R35 network switch, R36 network devices, and R37 GPU devices.

  • Updated collector H11 to capture both safe mode and standard nvidia-bug-report runs.

  • Updated collector R21 with improved CPER URL handling.

1.6

April 18, 2025

  • Added support for parallelized collection groups (PARALLEL_GROUP_COLLECTIONS, default True).

  • Added new Host collector H19 (Node Memory information).

  • Added user-configurable timeouts for collectors.

  • Added support for multiple collector groups under the -g argument.

  • Added ability for users to skip collectors (COLLECTOR_TO_SKIP, default None).

  • Enhanced the One-Click OOB Script to support GB200 and later platforms. Added support for Virtual EEPROM, authenticated HMC, and HTTPS support.

  • Enhanced all IPMI collectors with retry logic on non-zero return code (retry with the -v flag for more information).

  • Enhanced HTML report logic to not display N/A rows, which results in more condensed reports.

  • Enhanced security: hidden password entry on the CLI.

  • Modified Host H1 (node dmesg) to support more Linux distros with dmesg and journalctl if available.

  • Modified Host H2 (lspci) to support new flags full config space dumps and user defined commands.

  • Updated the baseboard parameter to be required because of API deltas on platforms.

1.6.1

April 23, 2025

  • Optimized R35, R36, R37, and R38 collection times.

  • Updated the user guide with usage information.

1.7.0

June 10, 2025

  • Added support for NVIDIA Open Source Review Board compliance.

  • Added new Host Collector H20 (SOS report for network devices).

  • Added new Redfish collector for data on thermal subsystem leak detector (R41).

  • Added new SSH collectors for device data off management controller buses: VirtualEEPROM (S16), SMBus (S17), and SMBPBI (S18).

  • Added new NV switch collector to obtain switch platform and FAE logs (H21) and the enabled SEL (R1) BMC dump.

  • Enhanced the OneClick script with minor fixes and additions to reach parity with NVDebug on the BMC SSH group.

  • Enhanced the HTML report with dependency check status, pre-flight check status, and improved readability.

  • Enhanced the built-in FPGA and EROT parsers for GB200 platforms.

  • Updated the user guide with usage information.