Getting Started with nvdebug

The NVIDIA® NVDebug tool, nvdebug, runs on server platforms or from remote client machines. This binary tool, which is available for x86_64 or arm64-SBSA architecture systems, collects the following information:

  • Out-of-band (OOB) BMC logs and information for troubleshooting server issues

  • Logs from the host

Requirements

Requirements for Client Host and Server Host

Requirement

Client Host

Server Host

Linux-based operating system: Linux kernel 4.4 or later supported
(version 4.15 or later recommended)

X

X

GNU C Library glibc-2.7 or later

X

X

OS: Ubuntu 18.04 or later supported
(Ubuntu 22.04 recommended)

X

X

Python 3.10

X

X

ipmitool 1.8.18 or later

X

X

The sshpass command

X

X

A server device under test (DUT) accessible by the BMC
from the client host using Redfish and IPMI-over-LAN.

X

X

The nvme-cli tool

X

BMC Management and Server Host Management networks
are in the same subnet.

X

BMC

Before running the NVDebug one-click script on the BMC, ensure the following tools are installed:

  • i2cset

  • i2ctransfer

  • ipmitool

  • i2cdump

  • curl

  • Redfish API

The nvdebug Command-Line Interface

The high-level syntax of the nvdebug command supports the collection of debug logs over OOB.

You can run the tool in either of the following ways:

  • From a remote machine with access to the BMC and host.

  • Directly on the host machine if the host can access the BMC.

If the host IP is passed through the configuration file or the command-line interface (CLI) using –I/--hostip, nvdebug assumes the tool runs on a remote machine. Otherwise, nvdebug assumes the tool runs on the host and collects the host logs locally.

Syntax

$ nvdebug -i <BMCIP> -u <BMCUSER> -p <BMCPASS> -t <PLATFORM>

Mandatory variables:
    -i/--ip is the BMC IP address.
    -u/--user is BMC username with administrative privileges.
    -p/--password is BMC administrative user password.
    -t/--platform is the platform type of the DUT, and it accepts DGX, HGX-Baseboard, or G+H.

Additional credentials:
    -r/--sshuser is BMC SSH username.
    -w/--sshpass is BMC SSH password.
    -R/--rfuser is BMC Redfish username.
    -W/--rfpass is BMC Redfish password.

Host options:
    -I/--hostip is the Host IP Address.
    -U/--hostuser is the Host username with administrative privileges.
    -H/--hostpass is the Host password.

Additional options:
    -b/--baseboard <baseboard> is the baseboard type.
                        The accepted values are hopper-hgx-8-gpu and blackwell-hgx-8-gpu.
    -c/--common collects the common logs using the included common.json file.
    -C/--config <file path> is the path to the config file. Default is ./config.yaml.
    -v/--verbose displays the detailed output and error messages.
    -o/--outdir <output dir> the output directory where the output is generated.
                        The default location is /tmp.
    -P/--port <fw_port> is the port number that will be used for forwarding.
                        The --port variable applies only to HGX-Baseboard based platforms,
                        and the default value is 18888.

Additional log collection options:
    -S/--cids CID [CID ...] runs the log collectors that correspond to the CIDs that were passed.
    -g/--loggroup <Redfish|IPMI|SSH|Host|HealthCheck> runs all log collectors of a specific type
                                   that is supported on the current platform. Only one collector
                                   group can be specified.
    -j/--vendor_file <vendor.json> is a vendor-defined JSON file that uses proprietary methods
                                   and tools as defined by the user.

The -S and –g options cannot be used together.

Utility options: -h/--help and --version are standalone options. –l/--list requires the platform
                 type to be specified using –t/--platform.
    --parse <log dump> parses an nvdebug log dump and decodes the binary data.
    -h/--help provides information about tool usage.
    --version displays the current version of the tool.
    -l/--list [Redfish|IPMI|SSH|Host|HealthCheck] lists log collectors that are supported by platform
                                                  with their collector IDs (CID).
                                                  If a type is passed, it will only list log collectors
                                                  of that type. The -l/--list options require the target
                                                  platform type to be specified with –t/--platform.

By default, if option -c is not included, the nvdebug tool will collect logs based on the common.json
and platform_xyz.json files. At the end of the run, the tool will generate the output log xyz.zip
file in the directory that was provided using the –o option. If no directory is provided, the log
will be generated in the /tmp directory.

The Configuration File

You can use the config.yaml file, which is in the same folder as the executable file by default, to provide additional (but optional) configuration setup. When both the CLI and configuration file specify an argument, the value provided by the CLI takes priority.

DGX H100/H200 Example

To run the tool using the command-line parameters, but not the configuration file:

nvdebug -i <bmc ip> -u <bmc user> -p <bmc pass> -t DGX -I <host IP> -U <host user> -H <host pass> -r <bmc ssh username> -w <bmc ssh password>

The I, -U, -H, -r, and -w options are optional.

To run the tool using the configuration file, provide all the required parameter settings including BMC_IP, BMC_USERNAME, BMC_PASSWORD, and PLATFORM. You can set the output directory by providing a value for the OUTPUT_DIR parameter (or by using the –o/--outdir CLI option). The following example shows the BMC credentials in the configuration setup:

PLATFORM: "DGX"
OUTPUT_DIR: "tmp/dgx/sys1"
DUT:
  BMC_IP: "bmc_ip"
  BMC_USERNAME: "bmc_user"
  BMC_PASSWORD: "bmc_pass"
  BMC_SSH_USERNAME: "ssh_user"
  BMC_SSH_PASSWORD: "ssh_pass"
  RF_User: "Redfish_user"
  RF_Pass: "Redfish_pass"
  ipmi_cipher: "-C17"

After configuring the NVDebug tool, run the nvdebug command.

Example output:

$ nvdebug

Log directory created at tmp/dgx/sys1/log_14_02_2024_12_01_54
[12:01:54] Identified system as Model: DGXH100, Partno: 965-24387-0002-000,
Serialno:1572221000391
[12:01:54] BMC IP: XXXX
Log collection has started
[12:06:16] Log collection is now complete
[12:06:16] Log collection took 4m 22.08s
Log zip created at /tmp/dgx/sys1/log_14_02_2024_12_01_54.zip

To show the verbose output of all log collections and the elapsed time for each log, run the nvdebug –v command.

Example output:

$ nvdebug -v

Log directory created at tmp/dgx/sys1/log_14_02_2024_12_01_54
[12:01:54] Identified system as Model: DGXH100, Partno: 965-24387-0002-000,
Serialno:1572221000391
[12:01:54] BMC IP: XXXX
Log collection has started
[12:02:05]
[12:02:05] #####################################
[12:02:05]
[12:02:05] Collecting redfish logs:
[12:02:05]
[12:02:05] Log collection was initiated for: system_event_log
[12:02:06] Log collection for system_event_log took 0m 1.03s
[12:02:06] Log collection was initiated for: firmware_inventory
[12:02:26] Log collection for firmware_inventory took 0m 20.08s
[12:02:26] Log collection was initiated for: firmware_inventory_expand_query
[12:03:05] Log collection for firmware_inventory_expand_query took 0m 39.06s
[12:03:05] Log collection was initiated for: chassis_info
 . . .

[12:04:13] Log collection was initiated for: dgx_manager_oem_log_dump
[12:04:15] Task state is New. Rechecking in 30 seconds...
[12:04:45] Task state is Running. Rechecking in 30 seconds...
[12:05:15] Task state is Running. Rechecking in 30 seconds...
[12:05:46] Task state is Running. Rechecking in 30 seconds...
[12:06:16] Log collection for dgx_manager_oem_log_dump took 2m 3.52s
[12:06:16] Log collection is now complete
[12:06:16] Log collection took 4m 22.08s
Log zip created at /tmp/dgx/sys1/log_14_02_2024_12_01_54.zip