Getting Started with nvdebug
The NVIDIA® NVDebug tool, nvdebug
, runs on server platforms or from remote client machines.
This binary tool, which is available for x86_64 or arm64-SBSA architecture systems, collects the
following information:
Out-of-band (OOB) BMC logs and information for troubleshooting server issues
Logs from the host
Requirements
Requirement |
Client Host |
Server Host |
---|---|---|
Linux-based operating system: Linux kernel 4.4 or later supported
(version 4.15 or later recommended)
|
X |
X |
GNU C Library glibc-2.7 or later |
X |
X |
OS: Ubuntu 18.04 or later supported
(Ubuntu 22.04 recommended)
|
X |
X |
Python 3.10 |
X |
X |
|
X |
X |
The |
X |
X |
A server device under test (DUT) accessible by the BMC
from the client host using Redfish and IPMI-over-LAN.
|
X |
X |
The |
X |
|
BMC Management and Server Host Management networks
are in the same subnet.
|
X |
BMC
Before running the NVDebug one-click script on the BMC, ensure the following tools are installed:
i2cset
i2ctransfer
ipmitool
i2cdump
curl
Redfish API
The nvdebug Command-Line Interface
The high-level syntax of the nvdebug
command supports the collection of debug logs over OOB.
You can run the tool in either of the following ways:
From a remote machine with access to the BMC and host.
Directly on the host machine if the host can access the BMC.
If the host IP is passed through the configuration file or the command-line interface (CLI)
using –I/--hostip
, nvdebug
assumes the tool runs on a remote machine. Otherwise,
nvdebug
assumes the tool runs on the host and collects the host logs locally.
Syntax
$ nvdebug -i <BMCIP> -u <BMCUSER> -p <BMCPASS> -t <PLATFORM>
Mandatory variables:
-i/--ip is the BMC IP address.
-u/--user is BMC username with administrative privileges.
-p/--password is BMC administrative user password.
-t/--platform is the platform type of the DUT, and it accepts DGX, HGX-Baseboard, or G+H.
Additional credentials:
-r/--sshuser is BMC SSH username.
-w/--sshpass is BMC SSH password.
-R/--rfuser is BMC Redfish username.
-W/--rfpass is BMC Redfish password.
Host options:
-I/--hostip is the Host IP Address.
-U/--hostuser is the Host username with administrative privileges.
-H/--hostpass is the Host password.
Additional options:
-b/--baseboard <baseboard> is the baseboard type.
The accepted values are hopper-hgx-8-gpu and blackwell-hgx-8-gpu.
-c/--common collects the common logs using the included common.json file.
-C/--config <file path> is the path to the config file. Default is ./config.yaml.
-v/--verbose displays the detailed output and error messages.
-o/--outdir <output dir> the output directory where the output is generated.
The default location is /tmp.
-P/--port <fw_port> is the port number that will be used for forwarding.
The --port variable applies only to HGX-Baseboard based platforms,
and the default value is 18888.
Additional log collection options:
-S/--cids CID [CID ...] runs the log collectors that correspond to the CIDs that were passed.
-g/--loggroup <Redfish|IPMI|SSH|Host|HealthCheck> runs all log collectors of a specific type
that is supported on the current platform. Only one collector
group can be specified.
-j/--vendor_file <vendor.json> is a vendor-defined JSON file that uses proprietary methods
and tools as defined by the user.
The -S and –g options cannot be used together.
Utility options: -h/--help and --version are standalone options. –l/--list requires the platform
type to be specified using –t/--platform.
--parse <log dump> parses an nvdebug log dump and decodes the binary data.
-h/--help provides information about tool usage.
--version displays the current version of the tool.
-l/--list [Redfish|IPMI|SSH|Host|HealthCheck] lists log collectors that are supported by platform
with their collector IDs (CID).
If a type is passed, it will only list log collectors
of that type. The -l/--list options require the target
platform type to be specified with –t/--platform.
By default, if option -c is not included, the nvdebug tool will collect logs based on the common.json
and platform_xyz.json files. At the end of the run, the tool will generate the output log xyz.zip
file in the directory that was provided using the –o option. If no directory is provided, the log
will be generated in the /tmp directory.
The Configuration File
You can use the config.yaml
file, which is in the same folder as the executable file by default,
to provide additional (but optional) configuration setup. When both the CLI and configuration file
specify an argument, the value provided by the CLI takes priority.
DGX H100/H200 Example
To run the tool using the command-line parameters, but not the configuration file:
nvdebug -i <bmc ip> -u <bmc user> -p <bmc pass> -t DGX -I <host IP> -U <host user> -H <host pass> -r <bmc ssh username> -w <bmc ssh password>
The I
, -U
, -H
, -r
, and -w
options are optional.
To run the tool using the configuration file, provide all the required parameter settings including
BMC_IP
, BMC_USERNAME
, BMC_PASSWORD
, and PLATFORM
. You can set the output directory
by providing a value for the OUTPUT_DIR
parameter (or by using the –o/--outdir
CLI option).
The following example shows the BMC credentials in the configuration setup:
PLATFORM: "DGX"
OUTPUT_DIR: "tmp/dgx/sys1"
DUT:
BMC_IP: "bmc_ip"
BMC_USERNAME: "bmc_user"
BMC_PASSWORD: "bmc_pass"
BMC_SSH_USERNAME: "ssh_user"
BMC_SSH_PASSWORD: "ssh_pass"
RF_User: "Redfish_user"
RF_Pass: "Redfish_pass"
ipmi_cipher: "-C17"
After configuring the NVDebug tool, run the nvdebug
command.
Example output:
$ nvdebug
Log directory created at tmp/dgx/sys1/log_14_02_2024_12_01_54
[12:01:54] Identified system as Model: DGXH100, Partno: 965-24387-0002-000,
Serialno:1572221000391
[12:01:54] BMC IP: XXXX
Log collection has started
[12:06:16] Log collection is now complete
[12:06:16] Log collection took 4m 22.08s
Log zip created at /tmp/dgx/sys1/log_14_02_2024_12_01_54.zip
To show the verbose output of all log collections and the elapsed time for each log,
run the nvdebug –v
command.
Example output:
$ nvdebug -v
Log directory created at tmp/dgx/sys1/log_14_02_2024_12_01_54
[12:01:54] Identified system as Model: DGXH100, Partno: 965-24387-0002-000,
Serialno:1572221000391
[12:01:54] BMC IP: XXXX
Log collection has started
[12:02:05]
[12:02:05] #####################################
[12:02:05]
[12:02:05] Collecting redfish logs:
[12:02:05]
[12:02:05] Log collection was initiated for: system_event_log
[12:02:06] Log collection for system_event_log took 0m 1.03s
[12:02:06] Log collection was initiated for: firmware_inventory
[12:02:26] Log collection for firmware_inventory took 0m 20.08s
[12:02:26] Log collection was initiated for: firmware_inventory_expand_query
[12:03:05] Log collection for firmware_inventory_expand_query took 0m 39.06s
[12:03:05] Log collection was initiated for: chassis_info
. . .
[12:04:13] Log collection was initiated for: dgx_manager_oem_log_dump
[12:04:15] Task state is New. Rechecking in 30 seconds...
[12:04:45] Task state is Running. Rechecking in 30 seconds...
[12:05:15] Task state is Running. Rechecking in 30 seconds...
[12:05:46] Task state is Running. Rechecking in 30 seconds...
[12:06:16] Log collection for dgx_manager_oem_log_dump took 2m 3.52s
[12:06:16] Log collection is now complete
[12:06:16] Log collection took 4m 22.08s
Log zip created at /tmp/dgx/sys1/log_14_02_2024_12_01_54.zip