Quick Start Guide#

Get nvdebug running in 5 minutes with this step-by-step guide.

Prerequisites#

Before you begin, ensure you have:

Required:

  • Linux system (Ubuntu 20.04+ supported, Ubuntu 24.04+ recommended)

  • Network access to target BMC

  • BMC administrative credentials

  • Target system powered on and accessible

Optional but Recommended:

  • SSH access to host system (for remote mode)

  • Host system credentials (for host log collection)

Step 1: Gather Your Credentials#

The following credentials are required for your target system:

BMC Credentials:

  • BMC IP address (e.g., 192.168.1.100)

  • BMC username

  • BMC password

Host Credentials (for remote mode):

  • Host IP address (e.g., 192.168.1.50)

  • Host username (with sudo privileges)

  • Host password

Example Credentials:

BMC IP: 192.168.1.100
BMC User: admin
BMC Password: password123

Host IP: 192.168.1.101
Host User: some_user
Host Password: some_password

Step 2: Determine Your Platform#

Identify your NVIDIA platform type:

Common Platform Types:

  • DGX - NVIDIA DGX systems

  • HGX-HMC - NVIDIA HGX systems with HMC

  • arm64 - ARM64-based systems

  • x86_64 - x86_64-based systems

  • NVSwitch - NVSwitch systems

  • PowerShelf - PowerShelf systems

Auto-Detection:

nvdebug can automatically detect your platform:

$ nvdebug -i <BMC_IP> -u <BMC_USER> -p <BMC_PASS>

Read DUT config file details from /path/to/dut_config.yaml
Loaded 1 DUT(s): ['<dut-name>']
Read config file details from /path/to/config.yaml
DUT config contains 1 DUT(s): ['<dut-name>']
Platform or baseboard not provided and not found in config files. Attempting to detect...
<dut-name>: System has 32 CPU cores, 1 active DUTs
<dut-name>: Thread allocation: 10 threads for groups, 5 threads per group for collectors
<dut-name>: Parallelization settings:
- Group parallelization: Enabled
- Collector parallelization: Disabled
<dut-name>: All preflight checks passed

================================================================================
PLATFORM AND BASEBOARD DETECTION RESULTS
================================================================================
Detected Platform:     x86_64
Detected Baseboard:    Blackwell-HGX-8-GPU
Detected Node Type:    Compute
================================================================================

Do you want to proceed with these detected values?
Options:
[Y]es - Use detected values
[N]o - Manually select from available combinations
[Q]uit - Exit nvdebug

Enter your choice (Y/N/Q):

Manual Specification:

If auto-detection fails, either use the CLI’s selector or specify manually:

CLI Selector:

The CLI Selector does not select the DGX or HGX-HMC as a platform. If you are using a DGX or HGX-HMC, you must manually specify the platform and baseboard.

================================================================================
MANUAL PLATFORM AND BASEBOARD SELECTION
================================================================================
Available platform-baseboard combinations:
--------------------------------------------------------------------------------

NVSWITCH PLATFORM:
1. GB200 NVL NVSwitchTray                   (Node Type: SwitchTray)
2. GB300 NVL NVSwitchTray                   (Node Type: SwitchTray)

POWERSHELF PLATFORM:
3. PowerShelfController                     (Node Type: PowerShelf)

ARM64 PLATFORM:
4. C2                                       (Node Type: Compute)
5. GB200 NVL                                (Node Type: Compute)
6. GB300 NVL                                (Node Type: Compute)
7. GH200                                    (Node Type: Compute)
8. GH200 NVL                                (Node Type: Compute)
9. MGX C2                                   (Node Type: Compute)
10. MGX-GH200                                (Node Type: Compute)
11. MGX-GH200-NVL2                           (Node Type: Compute)

X86_64 PLATFORM:
12. Blackwell-HGX-8-GPU                      (Node Type: Compute)
13. DC-Blackwell-PCIe                        (Node Type: Compute)
14. DC-Hopper-PCIe                           (Node Type: Compute)
15. HGX B300                                 (Node Type: Compute)
16. Hopper-HGX-8-GPU                         (Node Type: Compute)
17. MGX-4U-NVL16                             (Node Type: Compute)

================================================================================
Enter your choice (1-17):

Manual Specification:

nvdebug -i <BMC_IP> -u <BMC_USER> -p <BMC_PASS> -t "arm64" -b "GB200 NVL"

Step 3: Run Your First Collection#

Run your first nvdebug collection:

# Basic Out-of-band collection with auto-detection
nvdebug -i 192.168.1.100 -u bmc_user -p bmc_password

# Basic Out-of-band collection for ARM64 system with auto-detection of target baseboard
nvdebug -i 192.168.1.100 -u bmc_user -p bmc_password -t "arm64"

# Basic Out-of-band collection for ARM64 system with manual specification of target baseboard
nvdebug -i 192.168.1.100 -u bmc_user -p bmc_password -t "arm64" -b "GB200 NVL"

# Basic In-band collection with auto-detection
nvdebug -I 192.168.1.100 -U host_user -H host_password

# Basic In-band collection for ARM64 system with auto-detection of target baseboard
nvdebug -I 192.168.1.50 -U host_user -H host_password -t "arm64"

# Basic In-band collection for ARM64 system with manual specification of target baseboard
nvdebug -I 192.168.1.50 -U host_user -H host_password -t "arm64" -b "GB200 NVL"

# Basic In-band and Out-of-band collection for ARM64 system with manual specification of target baseboard
nvdebug -i 192.168.1.100 -u bmc_user -p bmc_password -I 192.168.1.50 -U host_user -H host_password -t "arm64" -b "GB200 NVL" -o /path/to/output

# With verbose output for learning
nvdebug ... -u bmc_user -p bmc_password -v

# With custom output directory
nvdebug ... -o /path/to/output

What to Expect:

  • Platform detection and validation

  • Grace CPU diagnostics collection

  • Unified memory monitoring

  • ARM-specific system information gathering

  • Collection progress updates

  • Output files in the specified directory

Pro Tip: Use the -v (verbose) flag when learning to see detailed information about what nvdebug is doing during collection.

Step 4: Verify Your Collection#

Check the Output Directory:

Navigate to your output directory to see the collected data:

cd /path/to/your/output/directory

  or

cd /tmp/

unzip nvdebug_logs_<date>_<time>.zip

cd /tmp/nvdebug_logs_<date>_<time>

ls -la

You’ll find several key files and directories:

nvdebug_logs_<date>_<time>/
├── .log_signature.txt
├── .nvdebug_stdout.log
├── reports/
│   ├── index.html
│   ├── file_map.html
│   ├── status_complete.html
│   ├── status_error.html
│   ├── status_partial.html
│   └── status_skipped.html
└── <dut_name>/
    ├── config.json
    ├── dut_config.json
    ├── Execution_Summary_Report.txt
    ├── nvdebug_runtime_output.txt
    ├── nvdebug_runtime_output_structured.txt
    ├── .metadata/
    ├── error_logs/
    ├── healthcheck/
    ├── host/
    ├── ipmi/
    ├── redfish/
    └── ssh/

Root Files:

  • .log_signature.txt - Log integrity verification

  • .nvdebug_stdout.log - Standard output from nvdebug

<dut_name> Files:

  • Execution_Summary_Report.txt - Complete status of all collectors

  • nvdebug_runtime_output.txt - Detailed runtime logs

  • nvdebug_runtime_output_structured.txt - Structured runtime logs (parallel processing view)

  • config.json - Configuration used for this run

  • dut_config.json - Device-specific configuration

<dut_name> Directories:

  • .metadata/ - Collection metadata (JSON files)

  • error_logs/ - Detailed error information for failed collectors

  • healthcheck/ - System health check data

  • host/ - System health check data

  • ipmi/ - System health check data

  • reports/ - HTML reports and status summaries

  • redfish/ - Redfish API data (JSON files and directories)

  • ssh/ - SSH-collected system information

View HTML Reports:

Open the reports in your browser for a visual summary:

<browser>reports/index.html

Success Indicators:

  • ✅ Collection completed without errors

  • ✅ Output directory created

  • ✅ Collection summary file present

  • ✅ Log files generated for each collector