Getting Started with nvdebug#
The NVIDIA® NVDebug tool, nvdebug
, runs on server platforms or from remote client machines.
This binary tool, which is available for x86_64 or arm64-SBSA architecture systems, collects the
following information:
Out-of-band (OOB) BMC logs and information for troubleshooting server issues
Logs from the host
Requirements#
Requirement |
Client Host |
Server Host |
---|---|---|
Linux-based operating system: Linux kernel 4.4 or later supported
(version 4.15 or later recommended)
|
X |
X |
GNU C Library glibc-2.7 or later |
X |
X |
OS: Ubuntu 18.04 or later supported
(Ubuntu 22.04 recommended)
|
X |
X |
Python 3.10 |
X |
X |
|
X |
X |
The |
X |
X |
A server device under test (DUT) accessible by the BMC
from the client host using Redfish and IPMI-over-LAN.
|
X |
X |
The |
X |
|
BMC Management and Server Host Management networks
are in the same subnet.
|
X |
NVSwitch tray host requires NVOS version 2.
The nvdebug Command-Line Interface#
The high-level syntax of the nvdebug
command supports the collection of debug logs over OOB.
You can run the tool in either of the following ways:
From a remote machine with access to the BMC and host.
Directly on the host machine if the host can access the BMC.
If the host IP is passed through the configuration file or the command-line interface (CLI)
using –I/--hostip
, nvdebug
assumes the tool runs on a remote machine. Otherwise,
nvdebug
assumes the tool runs on the host and collects the host logs locally.
Syntax#
$ nvdebug -i <BMCIP> -u <BMCUSER> -p <BMCPASS> -t <PLATFORM>
Mandatory options:
-i/--ip is the BMC IP address.
-u/--user is BMC username with administrative privileges.
-p/--password is BMC administrative user password.
-t/--platform is the platform type of the DUT, and it accepts DGX, HGX-HMC, arm64, x86_64, and NVSwitch.
Additional credentials:
-r/--sshuser is BMC SSH username.
-w/--sshpass is BMC SSH password.
-R/--rfuser is BMC Redfish username.
-W/--rfpass is BMC Redfish password.
Host options:
-I/--hostip is the Host IP Address.
If the IP address is not provided, the tool assumes it is running on the host machine.
-U/--hostuser is the Host username with administrative privileges.
-H/--hostpass is the Host password.
Additional options:
-b/--baseboard <baseboard> is the baseboard type, such as Hopper-HGX-8-GPU and Blackwell-HGX-8-GPU.
-C/--config <file path> is the path to the config file. The default is ./config.yaml.
-d/--dutconfig <dut config path> is the path to the DUT specific config file.
The default path is ./dut_config.yaml.
-c/--common collects the common logs using the included common.json file.
-v/--verbose displays the detailed output and error messages.
-o/--outdir <output dir> the output directory where the output is generated.
The default location is /tmp.
-P/--port <fw_port> is the port number that will be used for forwarding.
The --port variable applies only to HGX-Baseboard based platforms,
and the default value is 18888.
--local enables Local Execution mode.
-z/--skipzip skips zipping individual DUT folders.
Log collection options:
-S/--cids CID [CID ...] runs the log collectors that correspond to the CIDs that were passed.
-g/--loggroup <Redfish|IPMI|SSH|Host|HealthCheck> runs all log collectors of a specific type
that is supported on the current platform.
Only one collector group can be specified.
-j/--vendor_file <vendor.json> is a vendor-defined JSON file that uses proprietary methods
and tools as defined by the user.
The -S and –g options cannot be used together.
Utility options:
-h/--help and --version are standalone options, and –l/--list requires the platform
type to be specified using –t/--platform.
--parse <log dump> parses an nvdebug log dump and decodes the binary data.
-h/--help provides information about tool usage.
--version displays the current version of the tool.
-l/--list [Redfish|IPMI|SSH|Host|HealthCheck] lists log collectors that are supported by platform
with their collector IDs (CID). If a type is passed, it will only list log collectors
of that type. The -l/--list options require the target platform type to be specified with –t/--platform.
By default, if option -c is not included, the nvdebug tool will collect logs based on the common.json
and platform_xyz.json files. At the end of the run, the tool will generate the output log xyz.zip
file in the directory specified by the –o option. If no directory is provided, the log
will be generated in the /tmp directory.
The Configuration Files#
The NVDebug tool has two configuration files in the same folder as the executable:
The DUT configuration file: The default is
dut_config.yaml
.The NVDebug-specific configuration file: The default is
config.yaml
.
These files can be used to provide additional (but optional) configuration data. If an argument is provided by both the CLI and the configuration file, the value provided through the CLI takes precedence.
HGX H100/H200 8-GPU Example#
To communicate with the HGX baseboard, you need the BMC SSH credentials to set up SSH tunneling
through the BMC. By default, the SSH credentials are assumed to be the same as the BMC credentials.
To use different credentials, specify the –r
and –w
CLI options for the SSH username
and password, respectively.
nvdebug –i $BMCIP –u $BMCUSER –p $BMCPASS –r SSHUSER –w SSHPASS –t HGX-HMC –P port_num
Example output:
Log directory created at /tmp/nvdebug_logs_30_09_2024_12_27_46
Starting a collection for DUT dut-1
hgx-h100-node2: [12:28:13] Identified system as Model: P2312-A04, Partno: 692-22312-0001-000, Serialno:1324623011823
hgx-h100-node2: [12:28:13] User provided platform type: HGX-HMC
hgx-h100-node2: [12:28:13] BMC IP: XXXX
Log collection has started for dut-1
hgx-h100-node2: [12:45:43] Log collection is now complete
hgx-h100-node2: [12:45:43] Log collection took 17m 30.29s
DUT hgx-h100-node2 completed.
The log zip file (:literal:`nvdebug_logs_30_09_2024_12_27_46.zip`) will be created in the :literal:`/tmp` directory.
The SSH tunnel is set up automatically by the tool using the specified port, and the default value
is 18888
. To use an existing SSH tunnel, do not set up SSH tunneling in the configuration file,
as shown in the following dut_config
file:
hgx-h100-node2:
<<: *dut_defaults
BMC_IP: "bmc_ip"
BMC_USERNAME: "bmc_user"
BMC_PASSWORD: "bmc_pass"
BMC_SSH_USERNAME: "ssh_user"
BMC_SSH_PASSWORD: "ssh_pass"
TUNNEL_TCP_PORT: "port_num"
SETUP_PORT_FORWARDING: false
After configuring the NVDebug tool, run the nvdebug
command:
Note
The Host BMC needs to support port forwarding.
Example output:
$ nvdebug
Log directory created at /tmp/nvdebug_logs_30_09_2024_12_27_46
Starting a collection for DUT hgx-h100-node2
hgx-h100-node2: [12:28:13] Identified system as Model: P2312-A04, Partno: 692-22312-0001-000, Serialno:1324623011823
hgx-h100-node2: [12:28:13] User provided platform type: HGX-HMC
hgx-h100-node2: [12:28:13] BMC IP: XXXX
Log collection has started for hgx-h100-node2
hgx-h100-node2: [12:45:43] Log collection is now complete
hgx-h100-node2: [12:45:43] Log collection took 17m 30.29s
DUT hgx-h100-node2 completed.
The log zip file (nvdebug_logs_30_09_2024_12_27_46.zip) will be created in the /tmp directory.
DGX H100/H200 Example#
To run the tool using the command-line parameters, but not the configuration file:
nvdebug -i <bmc ip> -u <bmc user> -p <bmc pass> -t DGX -I <host IP> -U <host user> -H <host pass> -r <bmc ssh username> -w <bmc ssh password>
The I
, -U
, -H
, -r
, and -w
options are optional.
To run the tool using the configuration file, provide all the required parameter settings including
BMC_IP
, BMC_USERNAME
, BMC_PASSWORD
, and PLATFORM
. You can set the output directory
by providing a value for the OUTPUT_DIR
parameter (or by using the –o/--outdir
CLI option).
The following example shows the BMC credentials in the configuration setup:
DUT_Defaults: &dut_defaults
NodeType: "Compute"
ipmi_cipher: "-C17"
# Create a dut object and inherit the default values.
# For any specific configuration details, add them below.
dgx-h100-node2:
<<: *dut_defaults
BMC_IP: "bmc_ip"
BMC_USERNAME: "bmc_user"
BMC_PASSWORD: "bmc_pass"
BMC_SSH_USERNAME: "ssh_user"
BMC_SSH_PASSWORD: "ssh_pass"
RF_User: "redfish_user"
RF_Pass: "redfish_pass"
After configuring the NVDebug tool, run the nvdebug
command.
$ nvdebug -o /tmp/dgx/sys1
To show the verbose output of all log collections and the elapsed time for each log,
specify the –v
option.
$ nvdebug -o /tmp/dgx/sys1 -v
Example output:
Read DUT config file details from ..../dut_config.yaml
Read config file details from ..../config.yaml
Log directory created at /tmp/dgx/sys1/nvdebug_logs_30_09_2024_12_53_27
Starting a collection for DUT dgx-h100-node2
dgx-h100-node2: [12:53:33] All preflight checks passed
dgx-h100-node2: [12:53:35] Identified system as Model: DGXH100, Partno: 965-
24387-0002-003, Serialno:1660224000069
dgx-h100-node2: [12:53:35] User provided platform type: DGX
dgx-h100-node2: [12:53:35] BMC IP: XXXX
Log collection has started for dgx-h100-node2
dgx-h100-node2: [12:53:35]
dgx-h100-node2: [12:53:35] #####################################
dgx-h100-node2: [12:53:35]
dgx-h100-node2: [12:53:35] Collecting redfish logs:
dgx-h100-node2: [12:53:35]
dgx-h100-node2: [12:53:35] Log collection was initiated for: r8_firmware_inventory
dgx-h100-node2: [12:53:45] Log collection for r8_firmware_inventory took 0m 10.32s
dgx-h100-node2: [12:53:45] Log collection was initiated for: r9_firmware_inventory_expand_query
dgx-h100-node2: [12:56:49] Log collection for r9_firmware_inventory_expand_query took 3m 4.07s
dgx-h100-node2: [12:56:49] Log collection was initiated for: r10_chassis_info
dgx-h100-node2: [12:56:51] Log collection for r10_chassis_info took 0m 1.32s
dgx-h100-node2: [12:56:51] Log collection was initiated for: r11_chassis_expand_query
dgx-h100-node2: [12:57:06] Log collection for r11_chassis_expand_query took 0m 15.13s
...
dgx-h100-node2: [12:59:32] #####################################
dgx-h100-node2: [12:59:32]
dgx-h100-node2: [12:59:32] Collecting IPMI logs:
dgx-h100-node2: [12:59:32]
dgx-h100-node2: [12:59:32] Log collection was initiated for: i1_mc_info
dgx-h100-node2: [12:59:32] Log collection for i1_mc_info took 0m 0.2s
dgx-h100-node2: [12:59:32] Log collection was initiated for: i2_lan_info
dgx-h100-node2: [12:59:33] Log collection for i2_lan_info took 0m 0.85s
...
dgx-h100-node2: [13:02:59] #####################################
dgx-h100-node2: [13:02:59]
dgx-h100-node2: [13:02:59] Collecting host logs:
dgx-h100-node2: [13:02:59]
dgx-h100-node2: [13:02:59] Log collection was initiated for: h7_node_crash_dump
dgx-h100-node2: [13:03:00] Log collection for h7_node_crash_dump took 0m 1.02s
dgx-h100-node2: [13:03:00] Log collection was initiated for: h2_node_lspci
dgx-h100-node2: [13:03:04] Log collection for h2_node_lspci took 0m 3.57s
...
dgx-h100-node2: [13:15:42] Task state is Running. Rechecking in 30 seconds...
dgx-h100-node2: [13:16:18] Log collection for r17_dgx_manager_oem_log_dump took 12m 34.53s
dgx-h100-node2: [13:16:18] Log collection is now complete
dgx-h100-node2: [13:16:18] Log collection took 22m 44.08s
DUT dgx-h100-node2 completed.
Log zip created at /tmp/dgx/sys1/nvdebug_logs_30_09_2024_12_53_27.zip
DGX Platform Example#
To list the collectors that are available on a DGX platform, specify the -l
option and the -t DGX
option for log collectors and the DGX platform, respectively:
$ nvdebug -l -t DGX
Example output:
Redfish
CID Collector Name Log Location
R8 firmware_inventory Redfish_R8_firmware_inventory.json
R9 firmware_inventory_expand_query Redfish_R9_firmware_inventory_expand_query.json
R10 chassis_info Redfish_R10_chassis_info.json
R11 chassis_expand_query Redfish_R11_chassis_expand_query.json
R12 system_info Redfish_R12_system_info.json
R13 system_expand_query Redfish_R13_system_expand_query.json
R14 manager_info Redfish_R14_manager_info.json
R15 manager_expand_query Redfish_R15_manager_expand_query.json
R17 dgx_manager_oem_log_dump Redfish_R17_dgx_oem_dump_{manager_id}_{task_id}.tar.xz
R18 telemetry_metric_reports Redfish_R18_report_{metric_report}.json
R19 chassis_thermal_metrics Redfish_R19_chassis_{chassis}_thermal_metrics.json
R20 firmware_inventory_table Redfish_R20_firmware_inventory_table.txt
R22 task_details Redfish_R22_task_{task_id}.json
R23 nvlink_oob_logs Redfish_R23_NVLINK_OOB_Log_{id}.json
R25 additional_oob_logs Redfish_R25_OOB_Log_{id}.json
R26 chassis_certificates Redfish_R26_chassis_{chassis_id}_certificate.json
R29 background_copy_status Redfish_R29_{chassis_id}_copy_status.json
R30 software_inventory Redfish_R30_software_inventory
R32 system_post_codes Redfish_R32_system_post_codes
IPMI
CID Collector Name Log Location
I1 mc_info IPMI_I1_mc_info.txt
I2 lan_info IPMI_I2_lan_info.txt
I3 session_info IPMI_I3_session_info.txt
I4 fru_info IPMI_I4_fru_info.txt
I5 sdr_info IPMI_I5_sdr_info.txt
I6 sel_info IPMI_I6_sel_info.txt
I7 sensor_list IPMI_I7_sensor_list.txt
I8 sel_list IPMI_I8_sel_list.txt
I9 sel_raw_dump IPMI_I9_sel_raw_dump.txt
I10 chassis_status IPMI_I10_chassis_status.txt
I11 chassis_restart_cause IPMI_I11_chassis_restart_cause.txt
I12 user_list IPMI_I12_user_list.txt
I13 channel_info IPMI_I13_channel_info.txt
I14 sdr_elist IPMI_I14_sdr_elist.txt
SSH
CID Collector Name Log Location
S2 bmc_dmesg BMC_SSH_S2_bmc_dmesg.txt
S3 network_info BMC_SSH_S3_network_info/...
S5 bmc_list_kernel_modules BMC_SSH_S5_bmc_list_kernel_modules.txt
S8 bmc_mem_cpu_utilization BMC_SSH_S8_bmc_mem_cpu_utilization/...
S11 uptime BMC_SSH_S11_uptime.txt
S12 fpga_register_table BMC_SSH_S12_fpga_register_table.txt
S13 hmc_boot_status BMC_SSH_S13_hmc_boot_status.txt
S15 bmc_power_status BMC_SSH_S15_bmc_power_status/...
Host
CID Collector Name Log Location
H1 node_dmesg Host_H1_node_dmesg.tar.gz
H2 node_lspci Host_H2_node_lspci*.txt
H3 node_smbios Host_H3_dmidecode*.txt
H4 node_lshw Host_H4_lshw*.txt
H5 node_nvidia_smi Host_H5_nvidia-smi*.txt
H6 node_kern_log Host_H6_node_kern_log.tar.gz
H7 node_crash_dump Host_H7_node_crash_dump.tar.gz
H8 node_nvme_list Host_H8_nvme_list_-v.txt
H9 node_fabric_manager_log Host_H9_fabricmanager.log
H10 node_nvflash_log Host_H10_nvflash_--check_-i_{num}.txt
H11 nvidia_bug_report Host_H11_nvidia_bug_report_op.log.gz
H15 node_subnet_manager Host_H15_node_subnet_manager/
H16 one_diag_dump Host_H16_one_diag_dump/
H17 node_nvme_log_dump Host_H17_nvos_tech_support_dump/
HealthCheck
CID Collector Name Log Location
C1 out_of_band_health_check HealthCheck_C1_out_of_band_health_check.json
Redfish Collectors#
To collect only specific collectors, specify the -S
option for firmware inventory, system information,
and ipmi manager information.
nvdebug -i <bmc_ip> -u <bmc_user> -p <bmc_pass> ... -t DGX -v -S R8 I1 R12
Example output:
Log directory created at /tmp/nvdebug_logs_06_11_2024_15_40_27
Starting a collection for DUT dut-1
dut-1: [15:40:34] All preflight checks passed
dut-1: [15:40:34] Identified system as Model: DGXH100, Partno: 965-24387-0002-003, Serialno:1660224000069
dut-1: [15:40:34] User provided platform type: DGX
dut-1: [15:40:34] BMC IP: XXXX
Log collection has started for dut-1
dut-1: [15:40:34]
dut-1: [15:40:34] #####################################
dut-1: [15:40:34]
dut-1: [15:40:34] Collecting custom logs:
dut-1: [15:40:34]
dut-1: [15:40:34] Log collection was initiated for: r8_firmware_inventory
dut-1: [15:40:36] Log collection for r8_firmware_inventory took 0m 1.71s
dut-1: [15:40:36] Log collection was initiated for: r12_system_info
dut-1: [15:40:36] Log collection for r12_system_info took 0m 0.06s
dut-1: [15:40:36] Log collection was initiated for: i1_mc_info
dut-1: [15:40:36] Log collection for i1_mc_info took 0m 0.14s
dut-1: [15:40:36] Log collection is now complete
dut-1: [15:40:36] Log collection took 0m 2.16s
DUT dut-1 completed.
Log zip created at /tmp/nvdebug_logs_06_11_2024_15_40_27.zip
To run the Redfish log collectors, specify the -g
option for the Redfish
log group:
$ nvdebug -i $BMC_IP -u $BMC_USER -p $BMC_PASS -t DGX -g Redfish
IPv6 Configuration#
By default, the nvdebug
tool uses IPv4. For IPv6, set IP_NETWORK
to ipv6
in the DUT
configuration. When providing IPv6 addresses for the BMC/Host, do not use square brackets.