Getting Started with nvdebug#

The NVIDIA® NVDebug tool, nvdebug, runs on server platforms or from remote client machines. This binary tool, which is available for x86_64 or arm64-SBSA architecture systems, collects the following information:

  • Out-of-band (OOB) BMC logs and information for troubleshooting server issues

  • Logs from the host

Requirements#

Requirements for Client Host and Server Host#

Requirement

Client Host

Server Host

Linux-based operating system: Linux kernel 4.4 or later supported
(version 4.15 or later recommended)

X

X

GNU C Library glibc-2.7 or later

X

X

OS: Ubuntu 18.04 or later supported
(Ubuntu 22.04 recommended)

X

X

Python 3.10

X

X

ipmitool 1.8.18 or later

X

X

The sshpass command

X

X

A server device under test (DUT) accessible by the BMC
from the client host using Redfish and IPMI-over-LAN.

X

X

The nvme-cli tool

X

BMC Management and Server Host Management networks
are in the same subnet.

X

NVSwitch tray host requires NVOS version 2.

The nvdebug Command-Line Interface#

The high-level syntax of the nvdebug command supports the collection of debug logs over OOB.

You can run the tool in either of the following ways:

  • From a remote machine with access to the BMC and host.

  • Directly on the host machine if the host can access the BMC.

If the host IP is passed through the configuration file or the command-line interface (CLI) using –I/--hostip, nvdebug assumes the tool runs on a remote machine. Otherwise, nvdebug assumes the tool runs on the host and collects the host logs locally.

Syntax#

$ nvdebug -i <BMCIP> -u <BMCUSER> -p <BMCPASS> -t <PLATFORM>

Mandatory options:
    -i/--ip is the BMC IP address.
    -u/--user is BMC username with administrative privileges.
    -p/--password is BMC administrative user password.
    -t/--platform is the platform type of the DUT, and it accepts DGX, HGX-HMC, arm64, x86_64, and NVSwitch.

Additional credentials:
    -r/--sshuser is BMC SSH username.
    -w/--sshpass is BMC SSH password.
    -R/--rfuser is BMC Redfish username.
    -W/--rfpass is BMC Redfish password.

Host options:
    -I/--hostip is the Host IP Address.
     If the IP address is not provided, the tool assumes it is running on the host machine.
    -U/--hostuser is the Host username with administrative privileges.
    -H/--hostpass is the Host password.

Additional options:
    -b/--baseboard <baseboard> is the baseboard type, such as Hopper-HGX-8-GPU and Blackwell-HGX-8-GPU.
    -C/--config <file path> is the path to the config file. The default is ./config.yaml.
    -d/--dutconfig <dut config path> is the path to the DUT specific config file.
     The default path is ./dut_config.yaml.
    -c/--common collects the common logs using the included common.json file.
    -v/--verbose displays the detailed output and error messages.
    -o/--outdir <output dir> the output directory where the output is generated.
     The default location is /tmp.
    -P/--port <fw_port> is the port number that will be used for forwarding.
     The --port variable applies only to HGX-Baseboard based platforms,
     and the default value is 18888.
    --local enables Local Execution mode.
    -z/--skipzip skips zipping individual DUT folders.


Log collection options:
    -S/--cids CID [CID ...] runs the log collectors that correspond to the CIDs that were passed.
    -g/--loggroup <Redfish|IPMI|SSH|Host|HealthCheck> runs all log collectors of a specific type
     that is supported on the current platform.
     Only one collector group can be specified.
    -j/--vendor_file <vendor.json> is a vendor-defined JSON file that uses proprietary methods
     and tools as defined by the user.
     The -S and –g options cannot be used together.

Utility options:
    -h/--help and --version are standalone options, and –l/--list requires the platform
     type to be specified using –t/--platform.
    --parse <log dump> parses an nvdebug log dump and decodes the binary data.
    -h/--help provides information about tool usage.
    --version displays the current version of the tool.
    -l/--list [Redfish|IPMI|SSH|Host|HealthCheck] lists log collectors that are supported by platform
     with their collector IDs (CID). If a type is passed, it will only list log collectors
     of that type. The -l/--list options require the target platform type to be specified with –t/--platform.

    By default, if option -c is not included, the nvdebug tool will collect logs based on the common.json
    and platform_xyz.json files. At the end of the run, the tool will generate the output log xyz.zip
    file in the directory specified by the –o option. If no directory is provided, the log
    will be generated in the /tmp directory.

The Configuration Files#

The NVDebug tool has two configuration files in the same folder as the executable:

  • The DUT configuration file: The default is dut_config.yaml.

  • The NVDebug-specific configuration file: The default is config.yaml.

These files can be used to provide additional (but optional) configuration data. If an argument is provided by both the CLI and the configuration file, the value provided through the CLI takes precedence.

HGX H100/H200 8-GPU Example#

To communicate with the HGX baseboard, you need the BMC SSH credentials to set up SSH tunneling through the BMC. By default, the SSH credentials are assumed to be the same as the BMC credentials. To use different credentials, specify the –r and –w CLI options for the SSH username and password, respectively.

nvdebug –i $BMCIP –u $BMCUSER –p $BMCPASS –r SSHUSER –w SSHPASS –t HGX-HMC –P port_num

Example output:

Log directory created at /tmp/nvdebug_logs_30_09_2024_12_27_46
Starting a collection for DUT dut-1
hgx-h100-node2: [12:28:13] Identified system as Model: P2312-A04, Partno: 692-22312-0001-000, Serialno:1324623011823
hgx-h100-node2: [12:28:13] User provided platform type: HGX-HMC
hgx-h100-node2: [12:28:13] BMC IP: XXXX

Log collection has started for dut-1
hgx-h100-node2: [12:45:43] Log collection is now complete
hgx-h100-node2: [12:45:43] Log collection took 17m 30.29s
DUT hgx-h100-node2 completed.

The log zip file (:literal:`nvdebug_logs_30_09_2024_12_27_46.zip`) will be created in the :literal:`/tmp` directory.

The SSH tunnel is set up automatically by the tool using the specified port, and the default value is 18888. To use an existing SSH tunnel, do not set up SSH tunneling in the configuration file, as shown in the following dut_config file:

hgx-h100-node2:
  <<: *dut_defaults
  BMC_IP: "bmc_ip"
  BMC_USERNAME: "bmc_user"
  BMC_PASSWORD: "bmc_pass"
  BMC_SSH_USERNAME: "ssh_user"
  BMC_SSH_PASSWORD: "ssh_pass"
  TUNNEL_TCP_PORT: "port_num"

  SETUP_PORT_FORWARDING: false

After configuring the NVDebug tool, run the nvdebug command:

Note

The Host BMC needs to support port forwarding.

Example output:

$ nvdebug

Log directory created at /tmp/nvdebug_logs_30_09_2024_12_27_46
Starting a collection for DUT hgx-h100-node2
hgx-h100-node2: [12:28:13] Identified system as Model: P2312-A04, Partno: 692-22312-0001-000, Serialno:1324623011823
hgx-h100-node2: [12:28:13] User provided platform type: HGX-HMC
hgx-h100-node2: [12:28:13] BMC IP: XXXX

Log collection has started for hgx-h100-node2
hgx-h100-node2: [12:45:43] Log collection is now complete
hgx-h100-node2: [12:45:43] Log collection took 17m 30.29s
DUT hgx-h100-node2 completed.

The log zip file (nvdebug_logs_30_09_2024_12_27_46.zip) will be created in the /tmp directory.

DGX H100/H200 Example#

To run the tool using the command-line parameters, but not the configuration file:

nvdebug -i <bmc ip> -u <bmc user> -p <bmc pass> -t DGX -I <host IP> -U <host user> -H <host pass> -r <bmc ssh username> -w <bmc ssh password>

The I, -U, -H, -r, and -w options are optional.

To run the tool using the configuration file, provide all the required parameter settings including BMC_IP, BMC_USERNAME, BMC_PASSWORD, and PLATFORM. You can set the output directory by providing a value for the OUTPUT_DIR parameter (or by using the –o/--outdir CLI option). The following example shows the BMC credentials in the configuration setup:

DUT_Defaults: &dut_defaults
  NodeType: "Compute"
  ipmi_cipher: "-C17"

# Create a dut object and inherit the default values.
# For any specific configuration details, add them below.
dgx-h100-node2:
  <<: *dut_defaults
  BMC_IP: "bmc_ip"
  BMC_USERNAME: "bmc_user"
  BMC_PASSWORD: "bmc_pass"
  BMC_SSH_USERNAME: "ssh_user"
  BMC_SSH_PASSWORD: "ssh_pass"
  RF_User: "redfish_user"
  RF_Pass: "redfish_pass"

After configuring the NVDebug tool, run the nvdebug command.

$ nvdebug -o /tmp/dgx/sys1

To show the verbose output of all log collections and the elapsed time for each log, specify the –v option.

$ nvdebug -o /tmp/dgx/sys1 -v

Example output:

Read DUT config file details from ..../dut_config.yaml
Read config file details from ..../config.yaml
Log directory created at /tmp/dgx/sys1/nvdebug_logs_30_09_2024_12_53_27
Starting a collection for DUT dgx-h100-node2
dgx-h100-node2: [12:53:33] All preflight checks passed
dgx-h100-node2: [12:53:35] Identified system as Model: DGXH100, Partno: 965-
24387-0002-003, Serialno:1660224000069
dgx-h100-node2: [12:53:35] User provided platform type: DGX
dgx-h100-node2: [12:53:35] BMC IP: XXXX
Log collection has started for dgx-h100-node2
dgx-h100-node2: [12:53:35]
dgx-h100-node2: [12:53:35] #####################################
dgx-h100-node2: [12:53:35]
dgx-h100-node2: [12:53:35] Collecting redfish logs:
dgx-h100-node2: [12:53:35]
dgx-h100-node2: [12:53:35] Log collection was initiated for: r8_firmware_inventory
dgx-h100-node2: [12:53:45] Log collection for r8_firmware_inventory took 0m 10.32s
dgx-h100-node2: [12:53:45] Log collection was initiated for: r9_firmware_inventory_expand_query
dgx-h100-node2: [12:56:49] Log collection for r9_firmware_inventory_expand_query took 3m 4.07s
dgx-h100-node2: [12:56:49] Log collection was initiated for: r10_chassis_info
dgx-h100-node2: [12:56:51] Log collection for r10_chassis_info took 0m 1.32s
dgx-h100-node2: [12:56:51] Log collection was initiated for: r11_chassis_expand_query
dgx-h100-node2: [12:57:06] Log collection for r11_chassis_expand_query took 0m 15.13s

...

dgx-h100-node2: [12:59:32] #####################################
dgx-h100-node2: [12:59:32]
dgx-h100-node2: [12:59:32] Collecting IPMI logs:
dgx-h100-node2: [12:59:32]
dgx-h100-node2: [12:59:32] Log collection was initiated for: i1_mc_info
dgx-h100-node2: [12:59:32] Log collection for i1_mc_info took 0m 0.2s
dgx-h100-node2: [12:59:32] Log collection was initiated for: i2_lan_info
dgx-h100-node2: [12:59:33] Log collection for i2_lan_info took 0m 0.85s

...

dgx-h100-node2: [13:02:59] #####################################
dgx-h100-node2: [13:02:59]
dgx-h100-node2: [13:02:59] Collecting host logs:
dgx-h100-node2: [13:02:59]
dgx-h100-node2: [13:02:59] Log collection was initiated for: h7_node_crash_dump
dgx-h100-node2: [13:03:00] Log collection for h7_node_crash_dump took 0m 1.02s
dgx-h100-node2: [13:03:00] Log collection was initiated for: h2_node_lspci
dgx-h100-node2: [13:03:04] Log collection for h2_node_lspci took 0m 3.57s

...

dgx-h100-node2: [13:15:42] Task state is Running. Rechecking in 30 seconds...
dgx-h100-node2: [13:16:18] Log collection for r17_dgx_manager_oem_log_dump took 12m 34.53s
dgx-h100-node2: [13:16:18] Log collection is now complete
dgx-h100-node2: [13:16:18] Log collection took 22m 44.08s
DUT dgx-h100-node2 completed.
Log zip created at /tmp/dgx/sys1/nvdebug_logs_30_09_2024_12_53_27.zip

DGX Platform Example#

To list the collectors that are available on a DGX platform, specify the -l option and the -t DGX option for log collectors and the DGX platform, respectively:

$ nvdebug -l -t DGX

Example output:

Redfish
  CID    Collector Name                          Log Location
   R8    firmware_inventory                      Redfish_R8_firmware_inventory.json
   R9    firmware_inventory_expand_query         Redfish_R9_firmware_inventory_expand_query.json
  R10    chassis_info                            Redfish_R10_chassis_info.json
  R11    chassis_expand_query                    Redfish_R11_chassis_expand_query.json
  R12    system_info                             Redfish_R12_system_info.json
  R13    system_expand_query                     Redfish_R13_system_expand_query.json
  R14    manager_info                            Redfish_R14_manager_info.json
  R15    manager_expand_query                    Redfish_R15_manager_expand_query.json
  R17    dgx_manager_oem_log_dump                Redfish_R17_dgx_oem_dump_{manager_id}_{task_id}.tar.xz
  R18    telemetry_metric_reports                Redfish_R18_report_{metric_report}.json
  R19    chassis_thermal_metrics                 Redfish_R19_chassis_{chassis}_thermal_metrics.json
  R20    firmware_inventory_table                Redfish_R20_firmware_inventory_table.txt
  R22    task_details                            Redfish_R22_task_{task_id}.json
  R23    nvlink_oob_logs                         Redfish_R23_NVLINK_OOB_Log_{id}.json
  R25    additional_oob_logs                     Redfish_R25_OOB_Log_{id}.json
  R26    chassis_certificates                    Redfish_R26_chassis_{chassis_id}_certificate.json
  R29    background_copy_status                  Redfish_R29_{chassis_id}_copy_status.json
  R30    software_inventory                      Redfish_R30_software_inventory
  R32    system_post_codes                       Redfish_R32_system_post_codes

IPMI
  CID    Collector Name                          Log Location
   I1    mc_info                                 IPMI_I1_mc_info.txt
   I2    lan_info                                IPMI_I2_lan_info.txt
   I3    session_info                            IPMI_I3_session_info.txt
   I4    fru_info                                IPMI_I4_fru_info.txt
   I5    sdr_info                                IPMI_I5_sdr_info.txt
   I6    sel_info                                IPMI_I6_sel_info.txt
   I7    sensor_list                             IPMI_I7_sensor_list.txt
   I8    sel_list                                IPMI_I8_sel_list.txt
   I9    sel_raw_dump                            IPMI_I9_sel_raw_dump.txt
  I10    chassis_status                          IPMI_I10_chassis_status.txt
  I11    chassis_restart_cause                   IPMI_I11_chassis_restart_cause.txt
  I12    user_list                               IPMI_I12_user_list.txt
  I13    channel_info                            IPMI_I13_channel_info.txt
  I14    sdr_elist                               IPMI_I14_sdr_elist.txt

SSH
  CID    Collector Name                          Log Location
   S2    bmc_dmesg                               BMC_SSH_S2_bmc_dmesg.txt
   S3    network_info                            BMC_SSH_S3_network_info/...
   S5    bmc_list_kernel_modules                 BMC_SSH_S5_bmc_list_kernel_modules.txt
   S8    bmc_mem_cpu_utilization                 BMC_SSH_S8_bmc_mem_cpu_utilization/...
  S11    uptime                                  BMC_SSH_S11_uptime.txt
  S12    fpga_register_table                     BMC_SSH_S12_fpga_register_table.txt
  S13    hmc_boot_status                         BMC_SSH_S13_hmc_boot_status.txt
  S15    bmc_power_status                        BMC_SSH_S15_bmc_power_status/...

Host
  CID    Collector Name                          Log Location
   H1    node_dmesg                              Host_H1_node_dmesg.tar.gz
   H2    node_lspci                              Host_H2_node_lspci*.txt
   H3    node_smbios                             Host_H3_dmidecode*.txt
   H4    node_lshw                               Host_H4_lshw*.txt
   H5    node_nvidia_smi                         Host_H5_nvidia-smi*.txt
   H6    node_kern_log                           Host_H6_node_kern_log.tar.gz
   H7    node_crash_dump                         Host_H7_node_crash_dump.tar.gz
   H8    node_nvme_list                          Host_H8_nvme_list_-v.txt
   H9    node_fabric_manager_log                 Host_H9_fabricmanager.log
  H10    node_nvflash_log                        Host_H10_nvflash_--check_-i_{num}.txt
  H11    nvidia_bug_report                       Host_H11_nvidia_bug_report_op.log.gz
  H15    node_subnet_manager                     Host_H15_node_subnet_manager/
  H16    one_diag_dump                           Host_H16_one_diag_dump/
  H17    node_nvme_log_dump                      Host_H17_nvos_tech_support_dump/

HealthCheck
  CID    Collector Name                          Log Location
   C1    out_of_band_health_check                HealthCheck_C1_out_of_band_health_check.json

Redfish Collectors#

To collect only specific collectors, specify the -S option for firmware inventory, system information, and ipmi manager information.

nvdebug -i <bmc_ip> -u <bmc_user> -p <bmc_pass> ... -t DGX -v -S R8 I1 R12

Example output:

Log directory created at /tmp/nvdebug_logs_06_11_2024_15_40_27
Starting a collection for DUT dut-1
dut-1: [15:40:34] All preflight checks passed
dut-1: [15:40:34] Identified system as Model: DGXH100, Partno: 965-24387-0002-003, Serialno:1660224000069
dut-1: [15:40:34] User provided platform type: DGX
dut-1: [15:40:34] BMC IP: XXXX
Log collection has started for dut-1
dut-1: [15:40:34]
dut-1: [15:40:34] #####################################
dut-1: [15:40:34]
dut-1: [15:40:34] Collecting custom logs:
dut-1: [15:40:34]
dut-1: [15:40:34] Log collection was initiated for: r8_firmware_inventory
dut-1: [15:40:36] Log collection for r8_firmware_inventory took 0m 1.71s
dut-1: [15:40:36] Log collection was initiated for: r12_system_info
dut-1: [15:40:36] Log collection for r12_system_info took 0m 0.06s
dut-1: [15:40:36] Log collection was initiated for: i1_mc_info
dut-1: [15:40:36] Log collection for i1_mc_info took 0m 0.14s
dut-1: [15:40:36] Log collection is now complete
dut-1: [15:40:36] Log collection took 0m 2.16s
DUT dut-1 completed.
Log zip created at /tmp/nvdebug_logs_06_11_2024_15_40_27.zip

To run the Redfish log collectors, specify the -g option for the Redfish log group:

$ nvdebug -i $BMC_IP -u $BMC_USER -p $BMC_PASS -t DGX -g Redfish

IPv6 Configuration#

By default, the nvdebug tool uses IPv4. For IPv6, set IP_NETWORK to ipv6 in the DUT configuration. When providing IPv6 addresses for the BMC/Host, do not use square brackets.