Configuration Guide#
Learn how to configure NVDebug for your specific environment and requirements.
Configuration Methods#
NVDebug supports three configuration methods:
Command-line options - Direct parameters
Configuration files - YAML-based settings
Environment variables - System-wide defaults
NVDebug Runtime Configuration File (config.yaml)#
The NVDebug runtime configuration file is a YAML file that defines the tool’s runtime settings and operational parameters.
Basic Runtime Configuration:
# Server Platform Type
# NVDebug will run tests for the listed platform
# The accepted values are HGX-HMC, DGX, arm64, x86_64 and NVSwitch, PowerShelf
PLATFORM: "arm64"
# Server Baseboard Type
# The accepted values are as follows:
# "Hopper-HGX-8-GPU", "Blackwell-HGX-8-GPU",
# "HGX B300",
# "GH200",
# "C2",
# "GH200 NVL", "GB200 NVL", "GB300 NVL",
# "GB200 NVL NVSwitchTray", "GB300 NVL NVSwitchTray",
# "MGX C2", "MGX-GH200", "MGX-GH200-NVL2",
# "MGX-4U-NVL16", "DC-Blackwell-PCIe", "DC-Hopper-PCIe",
# "PowerShelfController"
TargetBaseboard: "GB200 NVL"
# Set flag to true to remove IP addresses, username and password from plaintext/JSON logs
LogSanitization: True
# Skip SSH logs flag
# When set to true, collection of SSH logs will be skipped
SKIP_BMC_SSH_LOGS: true
# Skip Host logs flag
# When set to true, collection of Host logs will be skipped
SKIP_HOST_LOGS: false
# Skip IPMI logs flag
# When set to true, collection of IPMI logs will be skipped
SKIP_IPMI_LOGS: false
# Skip Redfish OOB logs flag
# When set to true, collection of Redfish OOB logs will be skipped
SKIP_REDFISH_OOB_LOGS: false
Advanced Configuration Options#
# Automatically parse collected logs and re-zip the results
# When set to true, NVDebug will automatically run the parser on collected logs
# and create a new zip archive with the decoded content
AUTO_PARSE: false
# Skip logs that require port forwarding
# When set to true, all Redfish logs will be skipped when the platform is HGX-HMC as they require port forwarding
SKIP_PORT_FW: false
# HGX I2C1 Bus Address, default value is 11
HGX_I2C1_BUS_ADDRESS: 11
# HGX I2C2 Bus Address, default value is 12
HGX_I2C2_BUS_ADDRESS: 12
# HGX I2C Transfer write to address
HGX_I2CTRANSFER_WRITE_TO_ADDRESS: "0x11"
# Path to user-provided FPGA register mapping file to be used when running --parse.
# If not provided, we use the default file for the passed baseboard type in TargetBaseboard.
FPGA_REG_MAPPING_FILE: ""
# Redfish identifier search string
# Default value is "Id"
RedfishIdentifierSearch: "Id"
# Redfish API timeout
# Default value is 5
RedfishAPITimeout: 5
# Redfish retry count
# Default value is 2
RedfishRetryCount: 2
# System and Manager IDs
SystemID: ""
ManagerID: ""
# Additional properties to collect for firmware inventory table (R20)
# Must be an array of property names. By default, Id and Version are collected.
# Some additional properties are Description, Manufacturer, Name, SoftwareId, Updateable, Version, WriteProtected
FW_INVENTORY_TABLE_PROPERTIES:
- "Updateable"
# Enables some additional log collection. Set to True only in case of kernel crash issues.
# This will run nvidia-bug-report.sh with --extra-system-data flag in addition to the default --safe-mode.
# --extra-system-data flag might cause the system to hang.
EXTRA_LOG_COLLECTION: False
# HTML Report Generation
GENERATE_HTML_REPORTS: True
# If set to True, the tool will dump some of the redfish responses to the file even if the http status code is not 200
# Default value is True
DUMP_ERROR_REDFISH_RESPONSE: True
# List of collectors to skip
# COLLECTOR_TO_SKIP:
# - R2
# - R4
# If set to True, collect all pages regardless of collection level for paginated log collectors (R1, R4)
# This overrides the collection level skip logic and collects from the beginning
# Default value is False
# COLLECT_ALL_PAGES: False
# Maximum time in seconds to spend collecting paginated logs (R1, R4)
# This prevents infinite collection when there are very large log files
# Default value is 3600 seconds (1 hour)
# MAX_PAGE_COLLECTION_TIME: 3600
# List of system IDs to skip when collecting Redfish OOB logs
# This allows you to filter out specific systems from the collection on collectors that support dump collection
# -> R5, R6, R7, R16, R24, R28, R31, R35, R36, R37
#
# SYSTEM_ID_TO_SKIP:
# - "System-1"
# - "System-2"
# BMC Temp Directory
# BMC_TEMP_DIR: "/tmp"
# Task ID Prefix
# This prefix will be added to the task ID for all tasks collected by NVDebug.
# NVDebug will check if task ID is valid in Task Service. If not, it will add the prefix to the task ID, if prefix is provided.
# Otherwise, it will resume default behavior.
# Default value is empty string
# TASK_ID_PREFIX: ""
Timeout Configuration:
The following timeout values can be configured to adjust the collection time for different collectors. Default values are set to accommodate most use cases. If you encounter timeout issues, consider increasing the values. Keep in mind that higher timeout settings will result in longer log collection times.
# Timeout for nvidia-bug-report.sh in safe mode
# Default value is 1200 seconds (20 minutes)
NVIDIA_BUG_REPORT_SAFE_MODE_TIMEOUT: 1200
# Timeout for nvidia-bug-report.sh in standard mode
# Default value is 2400 seconds (40 minutes)
NVIDIA_BUG_REPORT_STANDARD_MODE_TIMEOUT: 2400
# NVOS Tech dump timeout in seconds. If the collector fails due to timeout, this value can be increased
# Default value is 450 seconds
NVOS_TECH_DUMP_TIMEOUT: 450
# Redfish dump timeout in seconds. If the collector fails due to timeout, this value can be increased
# Default value is 1500 seconds (25 minutes)
REDFISH_DUMP_TIMEOUT: 1500
# Redfish API timeout
# Default value is 5
RedfishAPITimeout: 5
# OOB Command timeout in seconds for SSH and IPMI operations
# Default value is 2000 seconds
OOB_COMMAND_TIMEOUT: 2000
# SSH Command timeout in seconds
# Default value is 180 seconds
SSH_COMMAND_TIMEOUT: 180
# Sleep duration in seconds between device collections for Redfish collectors
# Default value is 60 seconds
# Note: Do not modify this value unless you are told to do so by NVIDIA engineering
REDFISH_DEVICE_DUMP_SLEEP_DURATION: 60
# Maximum time in seconds to spend collecting paginated logs (R1, R4)
# This prevents infinite collection when there are very large log files
# Default value is 3600 seconds (1 hour)
MAX_PAGE_COLLECTION_TIME: 3600
Parallelization Options:
# Enable parallel group collections
PARALLEL_GROUP_COLLECTIONS: True
# Enable parallel collector collections
PARALLEL_COLLECTOR_COLLECTIONS: False
OOB (Out-of-Band) Configuration:
# OOB Maximum retries
OOB_MAX_RETRIES: 3
# OOB Retry interval in seconds
OOB_RETRY_INTERVAL: 5
# OOB Task check interval in seconds
OOB_TASK_CHECK_INTERVAL: 2
# OOB Maximum tracked tasks
OOB_MAX_TRACKED_TASKS: 100
# OOB Maximum messages per task
OOB_MAX_MESSAGES_PER_TASK: 1000
# OOB Compact logging
OOB_COMPACT_LOGGING: False
# OOB Device patterns for device discovery
OOB_DEVICE_PATTERNS: {}
Multi-Socket Configuration:
# Number of baseboards for multi-socket systems
MultiSocketNumberOfBaseboards: 1
Expanded Query Levels:
When running key collectors, you can expand the query level to capture more information. The default value is 1. Higher levels collect more data but also increase the time required for log collection.
# Expanded Query Levels on the collectors below. Set to the level that you want to capture.
# Default value is 1
EXPAND_QUERY_CHASSIS_LEVEL: 1
EXPAND_QUERY_FIRMWARE_INVENTORY_LEVEL: 1
EXPAND_QUERY_MANGER_LEVEL: 1
EXPAND_QUERY_SYSTEM_LEVEL: 1
Custom Dump Services:
Use the following configuration to collect custom dump services. Specify the URIs with the payload and timeout that is required for the dump service.
Examples below are illustrative and based on various dump services that may be supported by different platforms.
CUSTOM_DUMP_SERVICES:
- uri: "/redfish/v1/Systems/System_0/LogServices/DumpLogs/Actions/LogService.CollectDiagnosticData"
payload:
DiagnosticDataType: OEM
OEMDiagnosticDataType: SystemDiagnostic
timeout: 1500
- uri: "/redfish/v1/Systems/HGX_Baseboard_0/LogServices/Dump/Actions/LogService.CollectDiagnosticData"
payload:
DiagnosticDataType: OEM
OEMDiagnosticDataType: DiagnosticType=ROT
timeout: 1500
IPMI Collector Configuration:
# NVBMC platforms for IPMI collector I3
I3_NVBMC_PLATFORMS: []
Host Collector Configuration:
# H2 LSPCI timeout configurations
H2_LSPCI_TIMEOUT_FULL: 300
H2_LSPCI_TIMEOUT_PHYSICAL_TREE: 300
H2_LSPCI_TIMEOUT_LOGICAL_TREE: 300
H2_LSPCI_TIMEOUT_IOMEM: 180
H2_LSPCI_TIMEOUT_IOPORTS: 180
H2_LSPCI_RUN_EXTENDED: False
H2_LSPCI_TIMEOUT_FULL_EXTENDED: 3000
H2_LSPCI_USER_DEFINED_COMMANDS: []
H2_LSPCI_USER_DEFINED_COMMANDS_IGNORE_ERRORS: False
H2_LSPCI_USER_DEFINED_COMMANDS_TIMEOUT: 300
# H20 SOS Report configuration
H20_USE_COMBINED_COMMAND: True
H20_CAPTURE_ALL_FILES: False
H20_FILES_TO_SKIP: ['private_map']
H20_SOS_COMMAND_OPTIONS: ['openvswitch', 'rdma', 'hbn', 'infiniband', 'doca', 'networking', 'mlx5_core']
H20_STDOUT_FILE_NAME: 'sos_report_output.stdout.log'
H20_FLAGS: ["--clean"]
H20_EXTENDED_TIMEOUT: 600
H20_STANDARD_TIMEOUT: 180
H20_NETWORKING_HARDWARE_QUERY: {}
# NVOS CLI dumps configuration
NVOS_CLI_DUMPS: []
Redfish Collector Configuration:
# HGX System ID prefix
HGX_SYSTEM_ID_PREFIX: "HGX"
# HGX Manager ID prefix
HGX_MANAGER_ID_PREFIX: "HGX"
# Skip non-HGX system IDs
SKIP_NON_HGX_SYSTEM_IDS: False
# Skip non-HGX managers
SKIP_NON_HGX_MANAGERS: False
# HGX NVBMC Manager baseboards for compute
HGX_NVBMC_MANAGER_BASEBOARDS_COMPUTE: ["Blackwell-HGX-8-GPU", "HGX B300", "GB200 NVL", "GB300 NVL"]
# HGX NVBMC Manager platforms for switch
HGX_NVBMC_MANAGER_PLATFORMS_SWITCH: ["NVSwitch"]
# Redfish payload configurations for specific collectors
REDFISH_R3_PAYLOAD: {"DiagnosticDataType": "Manager"}
REDFISH_R5_PAYLOAD: {"DiagnosticDataType": "OEM", "OEMDiagnosticDataType": "DiagnosticType=SystemDiagnostic"}
REDFISH_R6_PAYLOAD: None
REDFISH_R7_PAYLOAD: {"DiagnosticDataType": "OEM", "OEMDiagnosticDataType": "DiagnosticType=SystemDiagnostic"}
REDFISH_R16_PAYLOAD_TYPES: ["DiagnosticType=RetLTSSM", "DiagnosticType=RetRegister"]
REDFISH_R16_PAYLOAD: {"DiagnosticDataType": "OEM", "OEMDiagnosticDataType": "DiagnosticType=RetLTSSM"}
REDFISH_R17_PAYLOAD: {"DiagnosticDataType": "OEM"}
REDFISH_R21_LEGACY_PLATFORMS: ["GH200", "Hopper-HGX-8-GPU"]
REDFISH_R21_SKIP_NON_HGX_SYSTEM_IDS: None
REDFISH_R24_PAYLOAD: {"DiagnosticDataType": "OEM", "OEMDiagnosticDataType": "DiagnosticType=FirmwareAttributes"}
REDFISH_R27_NVSWITCH_MEMBER_KEY: ["MGX"]
REDFISH_R27_HGX_MEMBER_KEY: ["HGX"]
REDFISH_R28_PAYLOAD: {"DiagnosticDataType": "OEM", "OEMDiagnosticDataType": "DiagnosticType=HardwareCheckout"}
REDFISH_R31_PAYLOAD: {"DiagnosticDataType": "OEM", "OEMDiagnosticDataType": "DiagnosticType=SystemDiagnostic"}
# R32 Post Codes configuration
# Example POST Code URIs:
# - "/redfish/v1/Systems/System_0/LogServices/CurrentBIOSPostCodes/Entries"
# - "/redfish/v1/Systems/System_0/LogServices/PreviousBIOSPostCodes/Entries"
R32_CHECK_HGX_BASEBOARD_ONLY: False
R32_POST_CODES_URI: None
R32_PROCESS_ALL_SYSTEMS: False
# R35 Network diagnostics configuration
R35_CHECK_HGX_BASEBOARD_ONLY: True
R35_SKIP_NON_HGX_SYSTEM_IDS: True
R35_DIAGNOSTIC_TYPE_NVLINK: "Net_NVLinkManagementNIC"
R35_DIAGNOSTIC_TYPE_CONNECTX: "NetIR"
R35_DEVICE_ID_PREFIX_CONNECTX: "ConnectX_NIC_"
R35_DEVICE_ID_PREFIX_NVLINK: "NVLinkManagementNIC_"
# R36 Network diagnostics configuration
R36_CHECK_HGX_BASEBOARD_ONLY: True
R36_DIAGNOSTIC_TYPE: "Net_NVSwitch"
R36_DEVICE_ID_PREFIX: None
R36_DEVICE_ID_KEY: "DeviceID"
R36_SKIP_NON_HGX_SYSTEM_IDS: True
# R37 Network diagnostics configuration
R37_CHECK_HGX_BASEBOARD_ONLY: True
DUT Configuration File (dut_config.yaml)#
The DUT configuration file uses YAML format to specify target systems and their properties.
Basic Configuration for a Single DUT:
# Currently only supports same type of DUT with the same config file passed through -C option
# Settings that are common among DUTs can be filled in DUT_Defaults.
# DUT specific details like BMC IP, etc, can be overriden for each dut object
DUT_Defaults: &dut_defaults ## User should not modify this line or add anything before this section
# Type of the DUT. Allowed options are Compute, SwitchTray and PowerShelf
NodeType: "Compute"
# BMC IP address
# This can also be provided via CLI using the -i option
BMC_IP: ""
# BMC credentials
# This can also be provided via CLI using the -u and -p options respectively
BMC_USERNAME: ""
BMC_PASSWORD: ""
# BMC SSH Credentials
BMC_SSH_USERNAME: ""
BMC_SSH_PASSWORD: ""
# BMC_SSH_KEY_PATH: ""
# BMC_SSH_PASSWORDLESS: false
# BMC_SSH_PORT: 22
# HMC Configuration
# HMC_TCP_PORT: 80 # Default is 80, use 443 for HMC with SSL enabled, if provided, HMC_RF_PROTOCOL will be ignored.
# HMC_RF_PROTOCOL: "http" # Default is "http", use "https" for HMC with SSL enabled, if provided, HMC_TCP_PORT will be ignored. http defaults to port 80, https defaults to port 443
# Redfish credentials
# If not provided, it is assumed to be the same as the BMC credentials
RF_User: ""
RF_Pass: ""
# Port to be used for port forwarding
# To collect certain logs from HMC, port forwarding needs to be setup. NVDebug will setup port forwarding using the provided port
# If no port is provided below, the default port 18888 will be used
TUNNEL_TCP_PORT: ""
ipmi_cipher: "-C17"
# Host details for Host log collection
# If IP or credentials are not provided, NVDebug assumes it is running on the host
HOST_IP: ""
HOST_USERNAME: ""
HOST_PASSWORD: ""
# HOST_SSH_KEY_PATH: ""
# HOST_SSH_PASSWORDLESS: false # Assuming you have configured passwordless sudo, via sudoers. If you have not please either provide a keypath or a password
# HOST_SSH_PORT: 22
# Redfish info
RF_DEFAULT_PREFIX: "/redfish/v1" # Redfish service root. Default is "/redfish/v1"
RF_AUTH: true # Indicates if REST API authentication is required by the Redfish service. True by default
# When set to False, the tool expects port forwarding to be setup manually if a port is provided
# Set it to true to make the tool handle port forwarding setup
SETUP_PORT_FORWARDING: True
# If set to true, the tool will clear any process on the ports defined via TUNNEL_TCP_PORT before setting up port forwarding
FORCE_PORT_FW: False
# Specify whether the BMC/Host network is "ipv4" or "ipv6". Default is "ipv4"
IP_NETWORK: 'ipv4'
# Create a dut object and inherit defaults.
# For any specific config details, add them below <<: *dut_defaults
dut-1: #Node name. Must be unique for each node
<<: *dut_defaults
#BMC_IP: ""
#HOST_IP: ""
Rack Configuration for Multiple DUTs:
For multi-system collections where systems are in the same rack, use a rack configuration file:
Note
The DUT_Defaults are passed to each system in the rack via YAML anchors. Each system can override the DUT_Defaults by passing the values in the system-specific section. The example below is a simplified version of the DUT_Defaults section. Refer to the dut_config.yaml file for the complete object.
DUT_Defaults: &dut_defaults
NodeType: ""
BMC_USERNAME: ""
BMC_PASSWORD: ""
HOST_USERNAME: ""
HOST_PASSWORD: ""
RF_DEFAULT_PREFIX: "/redfish/v1"
RF_AUTH: true
SETUP_PORT_FORWARDING: true
FORCE_PORT_FW: false
IP_NETWORK: 'ipv4'
compute-1: &compute_defaults
<<: *dut_defaults
NodeType: "Compute"
ConfigFileToUse: "config_compute.yaml"
HOST_USERNAME: "host_user"
HOST_PASSWORD: "host_password"
BMC_USERNAME: "bmc_user"
BMC_PASSWORD: "bmc_password"
BMC_IP: "192.168.1.100"
HOST_IP: "192.168.1.50"
compute-2:
<<: *compute_defaults
BMC_IP: "192.168.1.101"
HOST_IP: "192.168.1.51"
compute-3:
<<: *compute_defaults
BMC_IP: "192.168.1.102"
HOST_IP: "192.168.1.52"
switch-1: &switch_defaults
<<: *dut_defaults
NodeType: "SwitchTray"
ConfigFileToUse: "config_switch.yaml"
HOST_USERNAME: "host_user"
HOST_PASSWORD: "host_password"
BMC_USERNAME: "bmc_user"
BMC_PASSWORD: "bmc_password"
BMC_IP: "192.168.1.130"
HOST_IP: "192.168.1.70"
switch-2:
<<: *switch_defaults
BMC_IP: "192.168.1.131"
HOST_IP: "192.168.1.71"
powershelf-1: &powershelf_defaults
<<: *dut_defaults
NodeType: "PowerShelf"
ConfigFileToUse: "config_powershelf.yaml"
BMC_USERNAME: "bmc_user"
BMC_PASSWORD: "bmc_password"
BMC_IP: "192.168.1.170"
powershelf-2:
<<: *powershelf_defaults
BMC_IP: "192.168.1.171"
Run batch collection:
# Collect from all systems in rack
nvdebug --dut-config rack_config.yaml --output-dir /path/to/output
Command-Line Configuration#
Basic Collection with All Options Specified:
nvdebug -i 192.168.1.100 -u bmc_user -p bmc_pass \
-I 192.168.1.50 -U host_user -H host_pass \
-t "arm64" -b "GB200 NVL" \
-o /path/to/output -v
Basic Collection with Auto Detection:
nvdebug -i 192.168.1.100 -u bmc_user -p bmc_pass \
-I 192.168.1.50 -U host_user -H host_pass \
-o /path/to/output -v
Using Configuration File:
# Use DUT configuration file
nvdebug --config config.yaml --dut-config dut_config.yaml --output-dir /path/to/output
# Use runtime configuration file
nvdebug -c config.yaml -i 192.168.1.100 -u bmc_user -p bmc_pass
Network Configuration#
IPv6 Network Configuration:
By default, NVDebug uses IPv4. For IPv6, set IP_NETWORK to ipv6 in the DUT configuration. When providing IPv6 addresses for the BMC/Host, do not use square brackets.
# IPv6 configuration
IP_NETWORK: 'ipv6'
BMC_IP: "2001:db8::100"
HOST_IP: "2001:db8::50"
Port Forwarding Configuration:
# Port forwarding settings
SETUP_PORT_FORWARDING: True
FORCE_PORT_FW: False
TUNNEL_TCP_PORT: "" # Default port 18888 will be used if not specified
Configuration Validation#
The preflight check is a quick validation step that ensures the configuration is correct and that collectors can run. It does not collect any logs. NVIDIA recommends running the preflight check before executing collectors.
~~It checks if the configuration is valid and if the collectors can be run.~~ ~~It does not collect any logs.~~ ~~It is recommended to run the preflight check before running the collector.~~
Validate Configuration:
# Test configuration with preflight check
nvdebug --dut-config dut_config.yaml --preflight
Note
Security: Always use secure methods for storing credentials and consider using separate configuration files for different environments.