The nvdebug Command-Line Interface#
The nvdebug tool provides a command-line interface (CLI) for collecting debug logs over Out-of-Band (OOB) connections. It is designed for use with Devices Under Test (DUTs) and interacts with Baseboard Management Controllers (BMCs) to gather necessary diagnostic information.
Modes of Operation#
nvdebug can be executed in one of three modes:
Remote Mode: Run nvdebug from a remote machine that has access to both the BMC and the Host system.
Local Mode: Run nvdebug directly on the Host machine, provided the Host can access the BMC.
BMC Shell OneClick: Run the targeted OOB Collector shell script directly on the BMC shell. Refer to
OOB Collector Shell Script
for more information.
If the Host IP is specified using the -I
or --hostip
option either in the configuration file or via the CLI, nvdebug operates in Remote Mode. If the Host IP is not specified, nvdebug assumes it is running on the Host and collects host logs locally.
Basic Syntax#
Basic usage with mandatory options:
nvdebug -i <BMC_IP> -u <BMC_USER> -p <BMC_PASS> -t <PLATFORM>
Common usage in Remote Mode:
nvdebug -i <BMC_IP> -u <BMC_USER> -p <BMC_PASS> -t <PLATFORM> \
-I <HOST_IP> -U <HOST_USER> -H <HOST_PASS>
Common usage with additional credentials:
nvdebug -i <BMC_IP> -u <BMC_USER> -p <BMC_PASS> -t <PLATFORM> \
-r <SSH_USER> -w <SSH_PASS> -R <RF_USER> -W <RF_PASS>
Common usage in Local Mode with BMC connection:
nvdebug -i <BMC_IP> -u <BMC_USER> -p <BMC_PASS> -t <PLATFORM> --local
In Local Mode with BMC connection:
Host credentials (
-I
,-U
,-H
) are not requiredThe tool collects host logs directly from the local system
BMC credentials are still required for collecting BMC-related logs
All other options (like
-o
,-v
,-c
) remain available
Example with additional options in Local Mode with BMC connection:
nvdebug -i <BMC_IP> -u <BMC_USER> -p <BMC_PASS> -t <PLATFORM> \
--local -v -o /path/to/output -c
Local Mode without BMC Connection:
nvdebug -t <PLATFORM> --local
In Local Mode without BMC connection:
BMC credentials (
-i
,-u
,-p
) are not requiredOnly Host-related logs will be collected
BMC-related collectors will be skipped automatically
Useful for collecting system information when BMC access is not available or needed
Example with additional options in Local Mode without BMC:
nvdebug -t <PLATFORM> --local -v -o /path/to/output -c
Note
Local Mode is particularly useful when troubleshooting issues on the Host system itself, as it eliminates the need for remote access configuration.
Options#
Below are the available options categorized by their functionality.
Mandatory Options#
These options are required to run nvdebug successfully.
Note
The
-t/--platform
and-b/--baseboard
options are required for all operation modes.
-i, --ip <BMC_IP> : The IP address of the Baseboard Management Controller (BMC).
-u, --user <BMC_USER> : The username with administrative privileges on the BMC.
-p, --password <BMC_PASS>: The password for the BMC administrative user.
-t, --platform <PLATFORM>: The platform type of the DUT. Accepted values:
- DGX
- HGX-HMC
- arm64
- x86_64
- NVSwitch
-b, --baseboard <BASEBOARD> : The baseboard type of the DUT. This is required.
Refer to the `Supported Target Servers and NVIDIA Baseboards` table for valid values.
Additional BMC Credentials#
These options provide additional credentials for accessing the BMC.
-r, --sshuser <SSH_USER> : The SSH username for the BMC.
-w, --sshpass <SSH_PASS> : The SSH password for the BMC.
-R, --rfuser <RF_USER> : The Redfish username for the BMC.
-W, --rfpass <RF_PASS> : The Redfish password for the BMC.
Host Options#
These options are relevant when running nvdebug in Remote Mode.
-I, --hostip <HOST_IP> : The IP address of the Host machine.
If not provided, *nvdebug* assumes it is running on the Host.
-U, --hostuser <HOST_USER> : The username with administrative privileges on the Host.
-H, --hostpass <HOST_PASS> : The password for the Host user.
Other Options#
Additional configurations and settings.
-C, --config <CONFIG_PATH> : Path to the main configuration file.
Default: ./config.yaml
-d, --dutconfig <DUT_CONFIG_PATH> : Path to the DUT-specific configuration file.
Default: ./dut_config.yaml
-D, --default_collectors [GROUP (optional)] : Display the default log collectors of a platform type.
Requires the :literal:`-t/--platform` option.
Valid groups: Redfish, IPMI, SSH, Host, HealthCheck
When passed without specifying a group, prints all collectors.
-h, --help : Displays usage information.
-l, --list [GROUP (optional)] : Lists supported log collectors for the specified platform.
Requires the :literal:`-t/--platform` option.
Valid groups: Redfish, IPMI, SSH, Host, HealthCheck
When passed without specifying a group, prints all collectors.
--local : Enables Local Execution mode.
--parse <LOG_DUMP> : Parses an existing *nvdebug* log dump and decodes binary data.
-o, --outdir <OUTPUT_DIR> : Specifies the output directory for generated logs.
Default: /tmp
-P, --port <FW_PORT> : Port number used for forwarding.
Applies only to HGX-Baseboard platforms.
Default: 18888
-v, --verbose : Displays runtime logging output with higher level of verbosity to the console.
Default: False
-V : NVDebug Log Collection level:
Default: (no flag) - All necessary collectors. Always included.
-V: Verbose Log Collections.
-VV: Verbose Log Collections + Optional collectors that take hours to run.
--version : Displays the current tool version.
-z, --skipzip : Skip zip creation
Default: Zip archive is created.
-Z, --skipzipsplit : Skip splitting zip archive at 200MB Size constraint
Default: Zip archive is split into multiple files if it exceeds 200MB.
--zipsplit_threshold <SIZE> : Set the size threshold for splitting the archive into multiple files.
Default: 200MB
Applies only to zipped archives.
Note
Upon completion, the tool generates a .zip file containing the collected logs in the directory specified by the
-o/--outdir
option. If no output directory is specified, logs are stored in/tmp
.The
-z/--skipzip
and-Z/--skipzipsplit
options are mutually exclusive.The
-Z/--skipzipsplit
option is ignored if-z/--skipzip
is also specified.
Log collection options:#
These options control which logs are collected.
-j, --vendor_file <VENDOR_JSON> : Uses a vendor-defined JSON file containing proprietary methods and tools.
-S, --cids <CID> [CID ...] : Runs only the specified log collectors by their CIDs.
-g, --loggroup <GROUP> : Runs all log collectors in the specified group. Supported groups:
- Redfish
- IPMI
- SSH
- Host
- HealthCheck
-e, --override_platform : Flag to override platform restrictions for collectors.
Note
The
-S
and-g
options cannot be used together.Multiple collector groups can be specified at a time with the
-g
option.
Config Files#
nvdebug utilizes two configuration files located in the same directory as the executable:
DUT Configuration File (default:
dut_config.yaml
)NVDebug Configuration File (default:
config.yaml
)
These configuration files are optional and allow for additional configuration data. If a parameter is specified both through the CLI and a config file, the CLI value takes precedence.
dut_config.yaml#
The dut_config.yaml file defines parameters specific to the DUT.
Note
This file is optional and can be used to override default values. CLI arguments take precedence over the values in this file.
-d/--dutconfig
is used to specify the path to the dut_config
file.
Parameter |
Description |
---|---|
BMC_IP |
BMC IP address. |
BMC_USERNAME |
BMC username. |
BMC_PASSWORD |
BMC password. |
BMC_RF_PORT |
(Optional) Port number for the BMC’s Redfish service. |
BMC_SSH_USERNAME |
(Optional) BMC SSH username. |
BMC_SSH_PASSWORD |
(Optional) BMC SSH password. |
RF_User |
(Optional) BMC Redfish username. |
RF_Pass |
(Optional) BMC Redfish password. |
TUNNEL_TCP_PORT |
(Optional) Port for port forwarding. |
HOST_IP |
(Optional) Host OS IP address. |
HOST_USERNAME |
(Optional) Host username. |
HOST_PASSWORD |
(Optional) Host password. |
HOST_SSH_KEY_PATH |
(Optional) Path to the host SSH key. Note: Host password is still required for some log collectors. |
HOST_SSH_PASSWORDLESS |
(Optional) Set to True if passwordless ssh is required. Note: Setting this parameter assumes passwordless ssh has been enabled. Passwordless SUDO is also required as a setup option prerequisites |
HMC_IP |
(Optional) HMC IP address. |
ipmi_cipher |
(Optional) IPMI cipher suite to use. |
SETUP_PORT_FORWARDING |
(Optional) Set to True to enable HMC port forwarding. Default: False. |
RF_AUTH |
(Optional) Set to True if Redfish requires authentication. Default: True. |
RF_DEFAULT_PREFIX |
|
IP_NETWORK |
(Optional) Specifies IPv4 or IPv6. Default: IPv4. |
NodeType |
(Optional) Node type: |
ConfigFileToUse |
(Optional) Path of the config file to use for this DUT. Defaults to the file passed with -C. |
config.yaml#
The config.yaml file controls runtime options for nvdebug.
Note
This file is optional and can be used to override default values.
CLI arguments take precedence over values in this file.
If specified, users dos not need to pass the CLI arguments for the options in this file.
-C/--config
is used to specify the path to the config file.For platform and baseboard type combinations and details, refer to Supported Target Servers and NVIDIA Baseboards.
Option |
Description
|
---|---|
PLATFORM
|
The platform type of the DUT. Accepted values are:
- DGX
- HGX-HMC
- arm64
- x86_64
- NVSwitch
- PowerShelf
|
TargetBaseboard (Optional)
|
The NVIDIA GPU baseboard type. Required for the parse option in
NVDebug. Accepted values are:
- Hopper-HGX-8-GPU
- Blackwell-HGX-8-GPU
- DC-Hopper-PCIe
- DC-Blackwell-PCIe
- C2
- HGX B300
- MGX-GH200
- MGX C2
- MGX-GH200-NVL2
- MGX-4U-NVL16
- GH200
- GH200 NVL
- GB200 NVL
- GB300 NVL
|
LogSanitization (Optional)
|
If set to True, masks all IP addresses and provided credentials in the
text/JSON logs. By default, this value is True.
|
SKIP_PORT_FW (Optional)
|
If set to True, all Redfish logs under HGX will be skipped because
they require port forwarding.
Default: False.
|
SKIP_BMC_SSH_LOGS (Optional)
|
If set to True, all SSH log collection will be skipped.
Default: False.
|
SKIP_HOST_LOGS (Optional)
|
If set to True, all Host log collection will be skipped.
Default: False.
|
SKIP_IPMI_LOGS (Optional)
|
If set to True, all IPMI log collection will be skipped.
Default: False.
|
SKIP_REDFISH_OOB_LOGS (Optional)
|
If set to True, all Redfish OOB log collection will be skipped.
Default: False.
|
HGX_I2C1_BUS_ADDRESS (Optional)
|
I2C1 Bus number of NVIDIA HGX-Baseboard.
Default: 11.
|
HGX_I2CTRANSFER_WRITE_TO_ADDRESS (Optional)
|
I2C address (hex string) for i2ctransfer write operations.
Default: “0x11”.
|
SystemID (Optional)
|
The Redfish
SystemId resource name to be used. This is used tooverride the default Redfish
SystemId resource name.Default: None.
|
ManagerID (Optional)
|
The Host BMC manager ID name to be used. This is used to override the
default Host BMC manager ID name.
Default: None.
|
FW_INVENTORY_TABLE_PROPERTIES (Optional)
|
List of additional properties to be collected by the firmware
inventory table log collector (R20). Refer to Table 3 for more
information.
|
EXTRA_LOG_COLLECTION (Optional)
|
Enables additional log collection. Runs
nvidia-bug-report (H11) with the
--extra-system-data flag, which might cause the system tohang.
Default: False.
|
NVIDIA_BUG_REPORT_SAFE_MODE_TIMEOUT (Optional)
|
When set overrides the default timeout for
nvidia-bug-report (H11) in safe mode.Default: 1200 seconds (20 minutes)
|
NVIDIA_BUG_REPORT_STANDARD_MODE_TIMEOUT (Optional)
|
When set overrides the default timeout for
nvidia-bug-report (H11) in standard mode.Default: 2400 seconds (40 minutes)
|
NVLINK_OOB_URI (Optional)
|
List of Redfish URIs to be collected for NVLink-related logs. If left
empty, System Event Logs are collected.
|
ADDITIONAL_OOB_URI_COLLECTION (Optional)
|
List of additional Redfish URIs to be collected.
|
NVOS_TECH_DUMP_TIMEOUT (Optional)
|
Timeout for NVOS Tech-support dump (H14).
Default: 450 seconds (7.5 minutes)
|
EXPAND_QUERY_CHASSIS_LEVEL (Optional)
|
Controls expanded query depth for chassis Redfish collectors. Integer.
Default: 1.
|
EXPAND_QUERY_FIRMWARE_INVENTORY_LEVEL (Optional)
|
Controls expanded query depth for firmware inventory. Integer.
Default: 1.
|
EXPAND_QUERY_MANGER_LEVEL (Optional)
|
Controls expanded query depth for manager Redfish collectors. Integer.
Default: 1.
|
EXPAND_QUERY_SYSTEM_LEVEL (Optional)
|
Controls expanded query depth for system Redfish collectors. Integer.
Default: 1.
|
GENERATE_HTML_REPORTS (Optional)
|
If True, generates HTML reports in addition to logs.
Default: True.
|
COLLECTOR_TO_SKIP (Optional)
|
List of collector IDs to skip during log collection.
Example: [“R2”, “R4”].
Default: None.
|
SYSTEM_ID_TO_SKIP (Optional)
|
Provides a list of system IDs to skip during log collection. This option
filters out specific systems from the collection on collectors that
support dump collection.
Applies to the R5, R6, R7, R16, R24, R28, R31, R35, R36, and R37 collector IDs.
Example: [“System-1”, “System-2”].
Default: None.
|
TASK_ID_PREFIX (Optional)
|
An optional prefix for the ID for all collected tasks. NVDebug will verify
whether the task ID is valid in Task Service.
If the prefix is not valid, and is provided, NVDebug will add the prefix to the
task ID.
Default: None.
|
CUSTOM_DUMP_SERVICES (Optional)
|
An optional log collection service for each service in the
CUSTOM_DUMP_SERVICES - specified URI. A payload will be used to collectlogs.
Default: None.
|
Additional Notes:
If
BMC_IP
,BMC_USERNAME
,BMC_PASSWORD
, andPLATFORM
are defined in config.yaml, these values are used. Otherwise, they must be provided via the CLI using-i
,-u
,-p
, and-t
.If
BMC_SSH_USERNAME
,BMC_SSH_PASSWORD
,RF_User
, andRF_Pass
are not provided in either the config file or CLI, they default to the regular BMC credentials.For HGX-Baseboard systems, port forwarding is required to collect some data/logs from the HMC. Specify the port via the config file or -P/–port. To skip these logs, set
SKIP_PORT_FW
to True. Default: False.By default, the log archive is generated in /tmp. Specifying OUTPUT_DIR changes the output location.
By default, nvdebug assumes it runs on the Host OS and collects logs locally. If running from a remote machine, specify HOST_IP, HOST_USERNAME, and HOST_PASSWORD.
By default, nvdebug queries Redfish to discover SystemID and ManagerID. To restrict log collection to a single SystemID or ManagerID, specify these in the config file.
Collector R20 (firmware inventory table) collects ID and version by default. Additional properties can be defined under FW_INVENTORY_TABLE_PROPERTIES.
Platform and Baseboard Type#
For details on supported servers and baseboards, see Supported Target Servers and NVIDIA Baseboards.
Target Server Platform Category |
NVIDIA Baseboards |
Comments |
---|---|---|
arm64 |
GH200 C2 MGX C2 MGX-GH200 MGX-GH200-NVL2 MGX-4U-NVL16 GB200 NVL GB300 NVL |
Supports all server compute nodes with the arm64 architecture. |
HGX-HMC |
Hopper-HGX-8-GPU Blackwell-HGX-8-GPU HGX-B300 GB200 NVL GB300 NVL |
Supports log collection directly from the HMC using Host BMC TCP port forwarding or the Redfish Aggregation. |
DGX |
Supports NVIDIA DGX-H100 and later series servers. |
|
x86_64 |
MGX-PCIe-NVL16 DC-Blackwell-PCIe DC-Hopper-PCIe |
Supports all server compute nodes with the x86_64 architecture using NVIDIA PCIe cards. |
NVSwitch |
GB200 NVL NVSwitchTray GB300 NVL NVSwitchTray |
Supports the NVIDIA NVSwitch Node. |
PowerShelf |
PowerShelfController |
Supports the NVIDIA PowerShelf Node. |
Note
Any platform with a HMC can use HGX-HMC as the platform type. This enables access to HMC-specific collectors. If you need to run additional collectors specific to your platform, you can use the override flag (-e/–override_platform) along with specific collector IDs (-S/–cids).
Using Platform Overrides#
When working with different platforms, you may need to run specific collectors that aren’t enabled by default for your platform type. The override functionality allows you to execute any collector regardless of platform restrictions.
Example usage with overrides:
# For a DGX platform, running specific Redfish collectors
./nvdebug --config <config.yaml> --dutconfig <dut_config.yaml> --cids R1 R2 R3 R4 R5 R8
# By default, only R8 would run.
#However, if you include the --override_platform flag, all specified collector IDs will be executed:
./nvdebug --config <config.yaml> --dutconfig <dut_config.yaml> --cids R1 R2 R3 R4 R5 R8 --override_platform
Note
When using overrides:
- Use the -e/--override_platform
flag to bypass platform restrictions
- Specify collectors using -S/--cids
followed by collector IDs
- You can combine multiple collectors from different groups (Redfish, IPMI, SSH, Host)
- Be cautious when overriding as some collectors may not be compatible with all platforms
Log Archiving#
nvdebug provides several options for controlling archive creation and management:
Skip creating the final zip archive
Control zip archive splitting
Configure archive size threshold
By default, if archive exceeds 200MB, it will be split into 200 MB chunks to be recombined later.
Archive Control Options
# Skip creating the final zip archive
./nvdebug -i <ip> -u <username> -p <pass> -t <platform> -z/--skipzip
# Skip splitting large archives
./nvdebug -i <ip> -u <username> -p <pass> -t <platform> -Z/--skipzipsplit
# Set custom archive split threshold (in MB)
./nvdebug -i <ip> -u <username> -p <pass> -t <platform> --zipsplit_threshold 500
Passwordless SSH#
nvdebug enables users to establish passwordless SSH connections for seamless interaction with DUTs, including the Host OS and BMC. To properly configure this feature:
Set Up Passwordless SSH Generate an SSH key pair (if not already created) and copy the public key to each DUT to which you want to connect.
This can be achieved with the ssh-copy-id command for the Host OS and the BMC:
ssh-copy-id <DUT_HOST_USERNAME>@<DUT_HOST_IP> ssh-copy-id <DUT_BMC_USERNAME>@<DUT_BMC_IP>
Ensure that SSH is configured to accept key-based authentication.
For example, on Ubuntu, add the following to the sshd_config file:
PubkeyAuthentication yes AuthorizedKeysFile .ssh/authorized_keys
For other distributions, refer to the documentation for your specific Linux distribution.
Enable passwordless sudo. For nvdebug to function correctly, ensure that users have
passwordless sudo
access on the DUT. Update thesudoers
file on the DUT with the following command:sudo visudo
Add the following lines for the user and replace
<DUT_HOST_USERNAME>
with the actual username:# System Information Collection <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/dmesg <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/lspci <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/sbin/dmidecode <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/lshw <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/nvidia-smi <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/journalctl <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/sbin/nvme <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/grep /tmp/nvdebug_* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/grep /var/log/nvidia* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/grep /proc/* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/grep /sys/* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/grep /etc/nvidia* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/grep /etc/os-release <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/grep /etc/sos/sos-nvdebug.conf <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/find /tmp/nvdebug_* -type f -exec cat {} \; <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/find /var/log/nvidia* -type f -exec cat {} \; <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/find /proc/* -type f -exec cat {} \; <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/find /sys/* -type f -exec cat {} \; <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/find /etc/nvidia* -type f -exec cat {} \; <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/find /etc/sos/sos-nvdebug.conf -type f -exec cat {} \; <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/cat /tmp/nvdebug_* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/cat /var/log/nvidia* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/cat /proc/* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/cat /sys/* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/cat /etc/nvidia* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/cat /etc/os-release <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/cat /etc/sos/sos-nvdebug.conf <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/test -w /tmp <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/ls -al /var/log/fabricmanager.log <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/ls -al /var/log/nmx/nmx-c/fabricmanager.log # NVIDIA Tools <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/nvflash <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/nvidia-bug-report.sh <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/sbin/opensm <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/nvos* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /usr/bin/which # File Operations (restricted to specific directories) <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/mkdir /tmp/nvdebug_* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/mkdir /tmp/nvdebug_transfer* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/mkdir /var/log/nvidia* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/cp /tmp/nvdebug_*/* /tmp/ <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/cp /tmp/nvdebug_transfer*/* /tmp/ <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/cp /var/log/nvidia*/* /tmp/ <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/cp /var/log/fabricmanager.log /tmp/ <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/cp /var/log/nmx/nmx-c/fabricmanager.log /tmp/ <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/chmod 644 /tmp/nvdebug_*/* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/chmod 644 /tmp/nvdebug_transfer*/* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/chmod 644 /var/log/nvidia*/* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/rm /tmp/nvdebug_*/* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/rm /tmp/nvdebug_transfer*/* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/rm /var/log/nvidia*/* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/rm -rf /tmp/nvdebug_* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/rm -rf /tmp/nvdebug_transfer* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/tar -czf /tmp/nvdebug_*/*.tar.gz /tmp/nvdebug_* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/tar -czf /tmp/nvdebug_*/*.tar.gz /tmp/nvdebug_transfer* <DUT_HOST_USERNAME> ALL=(ALL) NOPASSWD: /bin/tar -czf /tmp/nvdebug_*/*.tar.gz /var/log/nvidia*
Note
The exact paths might vary depending on your Linux distribution. Use the
which
command to verify the correct paths on your system.This configuration provides the minimum required sudo permissions for nvdebug to function.
Never use
NOPASSWD: ALL
because this option poses a significant security risk.Some commands might be in different locations on different distributions (for example,
/usr/bin
versus/bin
).The tool uses these commands to collect system information, manage temporary files, and execute NVIDIA-specific tools.
If you encounter permission issues, verify that all required commands are included in the sudoers configuration.
The sudo commands are executed with the
-S
flag to read password from stdin, which is handled by the tool.To prevent unauthorized access to system files, file operations are restricted to specific directories.
The tool creates temporary files in
/tmp/nvdebug_*
,/tmp/nvdebug_transfer*
, and NVIDIA logs in/var/log/nvidia*
.The tool accesses system information from
/proc/*
,/sys/*
, and/etc/*
for diagnostic purposes.Each command and path combination is specified separately to ensure the correct sudoers syntax.
Warning
Here are some security considerations:
Principle of Least Privilege: This configuration grants only the minimum required permissions for nvdebug to function. Do not add unnecessary permissions.
Command Path Security: To prevent path-based attacks, always use absolute paths in the sudoers configuration.
Password Handling: The tool uses sudo with
-S
flag for password handling. Ensure proper password management in your environment.Temporary Files: The tool creates temporary files in /tmp/nvdebug_* and /tmp/nvdebug_transfer*. Ensure proper file permissions and cleanup.
File Access Restrictions: To prevent unauthorized access to system files, file operations are restricted to specific directories.
System Information Access: The tool needs read access to /proc/* and /sys/* for system diagnostics. These are read-only operations.
Command Arguments: To prevent command injection, each command’s arguments are explicitly specified.
Audit Trail: To track command execution, consider enabling sudo logging:
# Add to /etc/sudoers Defaults logfile="/var/log/sudo.log" Defaults log_input,log_output
Regular Review: Periodically review the sudoers configuration to ensure it remains secure and necessary.
Environment Variables: sudo might not preserve all environment variables. The tool handles this internally.
To verify the correct paths on your system, run the following command:
which dmesg lspci dmidecode lshw nvidia-smi journalctl nvme nvflash nvidia-bug-report.sh opensm tar cat test ls
Note
Here are some security best practices:
Regular Updates: Keep the system and nvdebug tool updated to the latest versions.
Access Control: Restrict access to the nvdebug tool and its configuration files.
Logging: Monitor sudo logs for any unauthorized or suspicious activities.
Network Security: When using remote collection, ensure secure network connections.
Data Protection: The collected logs might contain sensitive information. Handle them according to your security policies.
Directory Permissions: Ensure that the
/tmp/nvdebug_*
,/tmp/nvdebug_transfer*
, and/var/log/nvidia*
directories have appropriate permissions and are regularly cleaned up.System Information: The tool needs to read system information from
/proc/*
and/sys/*
for diagnostics. These are read-only operations.Command Arguments: To prevent command injection, and ensure proper access control, each command’s arguments are explicitly specified.
Enable the Passwordless SSH Flag Within the dut_config.yaml file, enable the passwordless SSH flag by setting it as shown under the DUT Object Options.
Additional Considerations#
Ensure that the DUT’s firewall or SSH configuration allows connections from the host system.
Test the passwordless SSH connection and sudo functionality before running nvdebug using commands like:
ssh <DUT_HOST_USERNAME>@<DUT_HOST_IP> 'sudo ls'
If managing multiple DUTs, consider using an automation tool (e.g., Ansible) to streamline key distribution and sudo configuration.
By following these steps, you can ensure a smooth setup of passwordless SSH for nvdebug.
The Parse Option#
The --parse
option in the CLI takes the nvdebug
log dump as the input and creates a new dump with the binary files that were decoded into plain-text and the archive files that were extracted into subdirectories.
The following binary files will be decoded:
IRoT/ERoT Dumps (Collector R6)
FPGA Register Dumps (Collector R5)
Note
To decode FPGA dumps, specify the baseboard type (hopper-hgx-8-gpu or blackwell-hgx-8-gpu).
Example:
$ ./nvdebug --parse /tmp/nvdebug_logs_01_10_2024_15_58_09.zip -b blackwell-hgx-8-gpu
Extracting the Zip File:
$ ls /tmp/*.zip
/tmp/nvdebug_logs_01_10_2024_15_58_09.zip /tmp/nvdebug_logs_01_10_2024_15_58_09_decoded.zip
FPGA Register Dump Folder:
$ ls /path/to/decoded/logs
Page1_2_0x0b_0x00.txt Page3_3_0x0a_0x00.txt fpga_dump.txt
ERoT/IRoT Dump Folder:
$ ls /path/to/decoded/logs
FPGA_0_dump.txt GPU_SXM_5_dump.txt
FPGA_0_query_boot_status.log GPU_SXM_5_query_boot_status.log
GPU_SXM_1.log GPU_SXM_6.log
GPU_SXM_1_CMS.txt GPU_SXM_6_CMS.txt
GPU_SXM_1_dump.txt GPU_SXM_6_dump.txt
GPU_SXM_1_query_boot_status.log GPU_SXM_6_query_boot_status.log
GPU_SXM_2.log GPU_SXM_7.log
GPU_SXM_2_CMS.txt GPU_SXM_7_CMS.txt
GPU_SXM_2_dump.txt GPU_SXM_7_dump.txt
GPU_SXM_2_query_boot_status.log GPU_SXM_7_query_boot_status.log
GPU_SXM_3.log GPU_SXM_8.log
GPU_SXM_3_CMS.txt GPU_SXM_8_CMS.txt
GPU_SXM_3_dump.txt GPU_SXM_8_dump.txt
GPU_SXM_3_query_boot_status.log GPU_SXM_8_query_boot_status.log
GPU_SXM_4.log HMC_0_dump.txt
GPU_SXM_4_CMS.txt HMC_0_query_boot_status.log
NVLinkManagementNIC_0_query_boot_status.log
GPU_SXM_4_dump.txt NVSwitch_0_dump.txt
GPU_SXM_4_query_boot_status.log NVSwitch_0_query_boot_status.log
GPU_SXM_5.log NVSwitch_1_query_boot_status.log
GPU_SXM_5_CMS.txt