Appendix#
Appendix A: Use Cases, IPMI, SSH APIs, and SSH Commands#
This section summarizes instructions, APIs, and commands used to gather debug information and logs for troubleshooting and provides information about the northbound APIs, commands, and standard tools.
Use Case |
API Call |
Details |
---|---|---|
Default (all known platforms)
|
nvdebug -i <BMCIP> -u <username> -p <password> -t <platform>
|
Runs common and platform-specific log collectors using built-in common and platform specific JSON files.
Runs on all known platforms including NVIDIA HGX-H100x8, MGX, and so on that have Redfish and IPMI standard command support.
|
Common only on any platform
|
nvdebug -i <BMCIP> -u <username> -p <password> -t <platform>
-c
|
Runs only common log collectors using the built-in common JSON file.
Runs on any platform that has Redfish and IPMI standard command support.
|
nvdebug tool as a service
|
nvdebug -i <BMCIP> -u <username> -p <password> -t <platform>
-j <vendor_custom.json>
|
Collects all logs using user/vendor-defined JSON, functions, and proprietary tools.
This option allows vendors and nvdebug users to extend the log collection based on OEM-specific implementation and using proprietary tools.
|
CID |
Collector Name |
Collection Level |
Description |
Platforms |
---|---|---|---|---|
R1 |
system_event_log |
EventLogs from the Redfish computer system collection from the Host BMC. |
All* |
|
R2 |
manager_existing_log_dump |
Existing log dumps that were created by the HMC/Host BMC or were initiated by a previous user from the BMC like the HMC and or the Host. BMC. |
All* |
|
R3 |
hgx_manager_on_demand_log_dump |
User-initiated demand log dumps with a new request from the HMC. |
The NVIDIA HGX-Baseboard |
|
R4 |
manager_journal_log |
Journal logs from the BMC, like the Host BMC or an HMC, that are running OpenBMC. |
All** |
|
R5 |
manager_fpga_register_dump |
FPGA register dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC or Grace-based platforms. |
All* |
|
R6 |
manager_erot_dump |
IRoT/ERoT debug log dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC or Grace-based platforms. |
All* |
|
R7
|
hgx_manager_self_test_report
|
Self-test logs collected on demand.
Users initiate the self-test of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC.
|
The NVIDIA HGX-Baseboard
|
|
R8 |
firmware_inventory |
Base firmware inventory from the HMC and or the Host BMC. |
All |
|
R9 |
firmware_inventory_expand_query |
Expanded firmware inventory from the HMC and or the Host BMC. |
All |
|
R10 |
chassis_info |
Base Redfish chassis collection from the HMC and or the Host BMC. |
All |
|
R11 |
chassis_expand_query |
Expanded Redfish chassis collection from the HMC and or the Host BMC. |
All |
|
R12 |
system_info |
Base Redfish computer system collection from the HMC and or the Host BMC. |
All |
|
R13 |
system_expand_query |
Expanded Redfish computer system collection from the HMC and or the Host BMC. |
All |
|
R14 |
manager_info |
Base Redfish manager collection of the HMC and or the Host BMC. |
All |
|
R15 |
manager_expand_query |
Expanded Redfish manager collection of the HMC and or the Host BMC. |
All |
|
R16 |
hgx_manager_retimer_dump |
-VV |
Retimer dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards using the HMC. |
NVIDIA HGX-Baseboard |
R17 |
dgx_manager_oem_log_dump |
-VV |
OEM log dump of DGX systems using the BMC. |
The NVIDIA DGX |
R18 |
telemetry_metric_reports |
Telemetry metric reports from the HMC and or the Host BMC telemetry service. |
All |
|
R19 |
chassis_thermal_metrics |
Thermal metrics from the HMC and or the Host BMC. |
All |
|
R20 |
firmware_inventory_table |
Gets specific fields of the firmware inventory as listed in the config file (under |
All |
|
R21 |
system_cper_logs |
Collects the CPER logs for the system. |
All arm64 platforms |
|
R22 |
task_details |
Collects details of all tasks in Redfish Task Service. |
All |
|
R23
|
nvlink_oob_logs
|
-VV
|
Collects OOB logs related to NVLink.
By default, collects System Event logs.
In the config file, URIs can be specified under
NVLINK_OOB_URI . |
All
|
R24 |
hgx_system_fw_attributes_dump |
-VV |
Firmware attribute dump of the NVIDIA baseboard through the HMC and or the Host BMC. |
All*** |
R25
|
additional_oob_logs
|
-VV
|
Collect user-defined OOB logs from the BMC/HMC.
In the config file, URIs can be specified under
ADDITIONAL_OOB_URI_COLLECTION . |
All
|
R26 |
chassis_certificates |
-VV |
Collects the certificates from each iRoT/ERoT component on the chassis. |
All |
R27 |
spdm_erot_measurements |
-VV |
Collects SPDM measurements of iRoT/ERoT at indices 26 and 50. |
NVIDA HGX-Baseboard |
R28 |
hgx_system_hardware_checkout_dump |
-VV |
Hardware checkout dump of the NVIDIA baseboard using the BMC/HMC. |
All*** |
R29 |
background_copy_status |
-VV |
Check the Root-of-Trust background copy status. |
All |
R30 |
software_inventory |
-VV |
Collects the software inventory from Redfish Update Service. |
All |
R31 |
hmc_fdr_log_dump |
-VV |
Collects the FDR log dump from the HMC. |
NVIDIA HGX-Baseboard |
R32 |
system_post_codes |
-VV |
Collects system POST codes. |
All |
R33 |
power_equipment_info |
-VV |
Collects the power equipment information. |
PowerShelf |
R34 |
power_equipment_expand_query |
-VV |
Collects the expanded power equipment information. |
PowerShelf |
R35 |
network_device_debug_dump |
-VV |
Collects the debug dump from the network devices. |
NVIDIA HGX-Baseboard |
R36 |
network_switch_debug_dump |
-VV |
Collects the debug dump from the network switches. |
NVIDIA HGX-Baseboard |
R37 |
gpu_debug_dump |
-VV |
Collects the debug dump from GPUs. |
NVIDIA HGX-Baseboard |
R38 |
gpu_diagnostic_dump |
-VV |
Collects the diagnostic dump from GPUs. |
NVIDIA HGX-Baseboard |
R39 |
sma_debug_dump |
-VV |
Collects the debug dump from System Management Agent (SMA) devices. |
NVIDIA HGX-Baseboard |
R40 |
custom_dump_service |
-VV |
Collects the dump from the custom URI and the payloads. |
NVIDIA HGX-Baseboard |
R41 |
chassis_thermal_subsystem_leak_detection |
-VV |
Collects the thermal subsystem leak detection data for the chassis. |
NVIDIA HGX-Baseboard |
Note
* - All platforms except NVIDIA DGX™.
** - Only OpenBMC firmware-based platforms.
*** - Only Platforms with HMC.
-VV - Collects Optional Logs that take longer to collect.
CID |
Collector Name |
Description |
Platforms |
---|---|---|---|
I1 |
mc_info |
BMC information. |
All |
I2 |
lan_info |
BMC network setting information. |
All |
I3 |
session_info |
BMC session details. |
All |
I4 |
fru_info |
Server component FRU information using the BMC. |
All |
I5 |
sdr_info |
Basic sensor data records information. |
All |
I6 |
sel_info |
System event log information. |
All |
I7 |
sensor_list |
Sensor list information using the BMC. |
All |
I8 |
sel_list |
SEL logs in text. |
All |
I9 |
sel_raw_dump |
SEL logs in hex. |
All |
I10 |
chassis_status |
Server chassis status and information. |
All |
I11 |
chassis_restart_cause |
System Restart Cause stored by the BMC. |
All |
I12 |
user_list |
IPMI user list. |
All |
I13 |
channel_info |
IPMI channel information. |
All |
I14 |
sdr_elist |
List of Sensor Data Records. |
All |
CID |
Collector Name |
Collection Level |
Description |
Platforms |
---|---|---|---|---|
H1 |
node_dmesg |
-V |
dmesg from the host node. |
All |
H2 |
node_lspci |
lspci details from the host node. |
All |
|
H3 |
node_smbios |
General SMBIOS and SMBIOS type 9 details from the host node. |
All |
|
H4 |
node_lshw |
Hardware and network details using lshw from the host node. |
All Compute Trays |
|
H5 |
node_nvidia_smi |
-V |
GPU information using nvidia-smi (if present). Not applicable for compute trays without GPUs (for example, MGX C2). |
All Compute Trays |
H6 |
node_kern_log |
Kernel information and events on the node. |
All |
|
H7 |
node_crash_dump |
Kernel crash dumps on the node. |
All |
|
H8 |
node_nvme_list |
List of NVMe SSDs (if present). |
All Compute Trays |
|
H9 |
node_fabric_manager_log |
Node fabric manager log (if present). |
All |
|
H10
|
node_nvflash_log*
|
Logs from the nvflash command (if present).
Not applicable for compute trays without GPUs (for example, MGX C2).
|
All Compute Trays
|
|
H11
|
nvidia_bug_report
|
Runs the nvidia-bug-report.sh script on the host with the
--safe-mode and collects output.If
EXTRA_LOG_COLLECTION is True, also run it with --extra-system-data . |
All Compute Trays
|
|
H12 |
nvos_inventory |
(NVOS Only) Gets the platform software, firmware, and hardware inventory. |
The NVIDIA NVSwitch™ Tray |
|
H13 |
nvos_gnmi_config |
(NVOS Only) Gets the gnmi status and configuration. |
The NVIDIA NVSwitch™ Tray |
|
H14 |
nvos_tech_support_dump |
(NVOS Only) Generates and collects the tech-support dump on the system. |
The NVIDIA NVSwitch™ Tray |
|
H15 |
node_subnet_manager |
Collects the subnet manager information from the host node. |
All |
|
H16 |
one_diag_dump |
Collects the diagnostic dump information. |
All |
|
H17 |
node_nvme_log_dump |
Collects the NVMe log dumps from the host node. |
All |
|
H18 |
node_os_info |
Collects the OS information from the host node. |
All |
|
H19 |
node_memory_info |
-V |
Collects the Memory information from the host node. |
All |
H20 |
sos_report |
-V |
Collects the SOS report for the host. |
All Compute Trays |
H21 |
nvos_cli_dumps |
-V |
Collects the CLI dumps from the NVOS system. |
The NVIDIA NVSwitch™ Tray |
Note
* Requires nvflash to be in the/bin directory and to be added to PATH. The NVIDIA kernel drivers (nvidia_uvm, nvidia_drm, nvidia_modeset, nvidia_peermem, nvidia, and nvidia_fs) should also be unloaded before you run the log collector
-V - Collects logs at with greater verbosity.
CID |
Collector Name |
Description Platforms |
|
---|---|---|---|
S1 |
bmc_status |
The OpenBMC firmware status, which applies only to the Host BMC that is running the OpenBMC firmware. |
All** |
S2 |
bmc_dmesg |
The BMC OS kernel dmesg information only through the Host BMC. |
All |
S3 |
network_info |
The BMC network and routing settings only through the Host BMC. |
All |
S4 |
openbmc_stack_info |
The OpenBMC OS and firmware stack information, which applies only to the Host BMC that is running OpenBMC firmware. |
All** |
S5 |
bmc_list_kernel_modules |
The BMC OS Kernel loadable module information only through the Host BMC. |
All |
S6 |
openbmc_pldm_journal_log |
The OpenBMC PLDM logs apply only to the Host BMC that is running the OpenBMC firmware. |
All** |
S7 |
i2c_device_list |
The Host BMC i2c device list and its information only through the Host BMC. |
All |
S8 |
bmc_mem_cpu_utilization |
The Host BMC Memory and CPU usage only through the Host BMC. |
All |
S9 |
openbmc_boot_status |
The OpenBMC boot status, which applies only to the Host BMC that is running OpenBMC firmware. |
All** |
S10 |
var_log_dir_zip |
The |
All |
S11 |
uptime |
Gets the uptime through the Host BMC. |
All |
S12 |
fpga_register_table |
Registers the FPGA table contents. |
DGX |
S13 |
hmc_boot_status |
HMC Boot Status, which applies only to the platforms with the HMC. |
DGX |
S14 |
hmc_boot_progress |
HMC Boot progress status, which applies only to the platforms with the HMC. |
All*** |
S15 |
bmc_power_status |
Get power fault register data from FPGA through the Host BMC only for platforms with NVIDIA FPGA. |
All |
S16 |
virtual_eeprom_data |
Retrieves the telemetry data from the virtual EEPROM by using I2C commands. |
All |
S17 |
smbus_power_temperature_telemetry |
Collects SMBus temperature, power, and staleness data. |
All |
S18 |
smbpbi_system_status |
Collects the information for the SMBPBI telemetry, firmware, hardware status, PCIe link/error counts, fencing, and so on. |
All |
Note
** Only Open BMC firmware based platforms (same as Table 3).
*** Only Platforms with HMC (same as Table 3).
CID |
Collector Name |
Description |
---|---|---|
C1 |
out_of_band_health_check |
Checks the health of various components and presence of mandatory fields using Redfish and IPMI. |
Appendix B: The JSON Files#
This section provides information about the tool’s vendor JSON option.
Optional: DebugLog JSON-Vendor or User-Defined APIs and Proprietary Tools#
This option allows OEM/ODM partners to use nvdebug to add the required support by using proprietary tools and OEM APIs:
Vendors (OEM/ODM) need to copy the proprietary tools and the OEM-specific
vendor_defined.json
file to the directory in which nvdebug is located.The
vendor_defined.json
file should contain the list of commands with the required parameters to run the tools/APIs from the command line.
Here is an example:
{
LogCollectorXYZ": "./xyz -u <user-name> -i <bmc-ip> ...,
LogCollectorABC": "abctool <param1> <param2> ...
}
Appendix C: Estimated Output Size#
The output zip file that contains the logs can vary in size, and the estimates by platform are listed in Table 8.
Platform |
Size of the Zip |
Size of the Unzipped Logs |
Run time |
---|---|---|---|
x86_64 |
L1 - 7-10MB L2 - 60-70MB |
L1 - 20-25MB, L2 - 200-210MB |
10-15 minutes |
arm64 |
L1 - 7-10MB L2 - 60-70MB |
L1 - 20-25MB, L2 - 200-210MB |
10-15 minutes |
HGX-HMC |
L1 - 14-18MB L2 - 120-130MB |
L1 - 70-80MB, L2 - 280-290MB |
15-20 minutes |
DGX |
L1 - 10-12MB L2 - 68-70MB |
L1 - 17-18MB, L2 - 68-70MB |
10-15 minutes |
NVSwitch |
L1 - 120-130MB L2 - 170-180MB |
L1 - 160-165MB, L2 - 170-180MB |
7-10 minutes |
Note
Host-based node_dmesg, node_kern_logs can be arbitrarily large depending on how long the host has been up and running. This can cause the overall nvdebug output size to drastically increase.