Quick Start Guide#
Get nvdebug running in 5 minutes with this step-by-step guide.
Prerequisites#
Before you begin, ensure you have:
✅ Required:
Linux system (Ubuntu 20.04+ supported, Ubuntu 24.04+ recommended)
Network access to target BMC
BMC administrative credentials
Target system powered on and accessible
✅ Optional but Recommended:
SSH access to host system (for remote mode)
Host system credentials (for host log collection)
Step 1: Gather Your Credentials#
The following credentials are required for your target system:
BMC Credentials:
BMC IP address (e.g., 192.168.1.100)
BMC username
BMC password
Host Credentials (for remote mode):
Host IP address (e.g., 192.168.1.50)
Host username (with sudo privileges)
Host password
Example Credentials:
BMC IP: 192.168.1.100
BMC User: admin
BMC Password: password123
Host IP: 192.168.1.101
Host User: some_user
Host Password: some_password
Step 2: Determine Your Platform#
Identify your NVIDIA platform type:
Common Platform Types:
DGX - NVIDIA DGX systems
HGX-HMC - NVIDIA HGX systems with HMC
arm64 - ARM64-based systems
x86_64 - x86_64-based systems
NVSwitch - NVSwitch systems
PowerShelf - PowerShelf systems
Auto-Detection:
nvdebug can automatically detect your platform:
$ nvdebug -i <BMC_IP> -u <BMC_USER> -p <BMC_PASS>
Read DUT config file details from /path/to/dut_config.yaml
Loaded 1 DUT(s): ['<dut-name>']
Read config file details from /path/to/config.yaml
DUT config contains 1 DUT(s): ['<dut-name>']
Platform or baseboard not provided and not found in config files. Attempting to detect...
<dut-name>: System has 32 CPU cores, 1 active DUTs
<dut-name>: Thread allocation: 10 threads for groups, 5 threads per group for collectors
<dut-name>: Parallelization settings:
- Group parallelization: Enabled
- Collector parallelization: Disabled
<dut-name>: All preflight checks passed
================================================================================
PLATFORM AND BASEBOARD DETECTION RESULTS
================================================================================
Detected Platform: x86_64
Detected Baseboard: Blackwell-HGX-8-GPU
Detected Node Type: Compute
================================================================================
Do you want to proceed with these detected values?
Options:
[Y]es - Use detected values
[N]o - Manually select from available combinations
[Q]uit - Exit nvdebug
Enter your choice (Y/N/Q):
Manual Specification:
If auto-detection fails, either use the CLI’s selector or specify manually:
CLI Selector:
The CLI Selector does not select the DGX or HGX-HMC as a platform. If you are using a DGX or HGX-HMC, you must manually specify the platform and baseboard.
================================================================================
MANUAL PLATFORM AND BASEBOARD SELECTION
================================================================================
Available platform-baseboard combinations:
--------------------------------------------------------------------------------
NVSWITCH PLATFORM:
1. GB200 NVL NVSwitchTray (Node Type: SwitchTray)
2. GB300 NVL NVSwitchTray (Node Type: SwitchTray)
POWERSHELF PLATFORM:
3. PowerShelfController (Node Type: PowerShelf)
ARM64 PLATFORM:
4. C2 (Node Type: Compute)
5. GB200 NVL (Node Type: Compute)
6. GB300 NVL (Node Type: Compute)
7. GH200 (Node Type: Compute)
8. GH200 NVL (Node Type: Compute)
9. MGX C2 (Node Type: Compute)
10. MGX-GH200 (Node Type: Compute)
11. MGX-GH200-NVL2 (Node Type: Compute)
X86_64 PLATFORM:
12. Blackwell-HGX-8-GPU (Node Type: Compute)
13. DC-Blackwell-PCIe (Node Type: Compute)
14. DC-Hopper-PCIe (Node Type: Compute)
15. HGX B300 (Node Type: Compute)
16. Hopper-HGX-8-GPU (Node Type: Compute)
17. MGX-4U-NVL16 (Node Type: Compute)
================================================================================
Enter your choice (1-17):
Manual Specification:
nvdebug -i <BMC_IP> -u <BMC_USER> -p <BMC_PASS> -t "arm64" -b "GB200 NVL"
Step 3: Run Your First Collection#
Run your first nvdebug collection:
# Basic Out-of-band collection with auto-detection
nvdebug -i 192.168.1.100 -u bmc_user -p bmc_password
# Basic Out-of-band collection for ARM64 system with auto-detection of target baseboard
nvdebug -i 192.168.1.100 -u bmc_user -p bmc_password -t "arm64"
# Basic Out-of-band collection for ARM64 system with manual specification of target baseboard
nvdebug -i 192.168.1.100 -u bmc_user -p bmc_password -t "arm64" -b "GB200 NVL"
# Basic In-band collection with auto-detection
nvdebug -I 192.168.1.100 -U host_user -H host_password
# Basic In-band collection for ARM64 system with auto-detection of target baseboard
nvdebug -I 192.168.1.50 -U host_user -H host_password -t "arm64"
# Basic In-band collection for ARM64 system with manual specification of target baseboard
nvdebug -I 192.168.1.50 -U host_user -H host_password -t "arm64" -b "GB200 NVL"
# Basic In-band and Out-of-band collection for ARM64 system with manual specification of target baseboard
nvdebug -i 192.168.1.100 -u bmc_user -p bmc_password -I 192.168.1.50 -U host_user -H host_password -t "arm64" -b "GB200 NVL" -o /path/to/output
# With verbose output for learning
nvdebug ... -u bmc_user -p bmc_password -v
# With custom output directory
nvdebug ... -o /path/to/output
What to Expect:
Platform detection and validation
Grace CPU diagnostics collection
Unified memory monitoring
ARM-specific system information gathering
Collection progress updates
Output files in the specified directory
Pro Tip: Use the -v (verbose) flag when learning to see detailed information about what nvdebug is doing during collection.
Step 4: Verify Your Collection#
Check the Output Directory:
Navigate to your output directory to see the collected data:
cd /path/to/your/output/directory
or
cd /tmp/
unzip nvdebug_logs_<date>_<time>.zip
cd /tmp/nvdebug_logs_<date>_<time>
ls -la
You’ll find several key files and directories:
nvdebug_logs_<date>_<time>/
├── .log_signature.txt
├── .nvdebug_stdout.log
├── reports/
│ ├── index.html
│ ├── file_map.html
│ ├── status_complete.html
│ ├── status_error.html
│ ├── status_partial.html
│ └── status_skipped.html
└── <dut_name>/
├── config.json
├── dut_config.json
├── Execution_Summary_Report.txt
├── nvdebug_runtime_output.txt
├── nvdebug_runtime_output_structured.txt
├── .metadata/
├── error_logs/
├── healthcheck/
├── host/
├── ipmi/
├── redfish/
└── ssh/
Root Files:
.log_signature.txt
- Log integrity verification.nvdebug_stdout.log
- Standard output from nvdebug
<dut_name> Files:
Execution_Summary_Report.txt
- Complete status of all collectorsnvdebug_runtime_output.txt
- Detailed runtime logsnvdebug_runtime_output_structured.txt
- Structured runtime logs (parallel processing view)config.json
- Configuration used for this rundut_config.json
- Device-specific configuration
<dut_name> Directories:
.metadata/
- Collection metadata (JSON files)error_logs/
- Detailed error information for failed collectorshealthcheck/
- System health check datahost/
- System health check dataipmi/
- System health check datareports/
- HTML reports and status summariesredfish/
- Redfish API data (JSON files and directories)ssh/
- SSH-collected system information
View HTML Reports:
Open the reports in your browser for a visual summary:
<browser>reports/index.html
Success Indicators:
✅ Collection completed without errors
✅ Output directory created
✅ Collection summary file present
✅ Log files generated for each collector