Appendix#

Appendix A: Use Cases, IPMI, SSH APIs, and SSH Commands#

This section summarizes instructions, APIs, and commands used to gather debug information and logs for troubleshooting and provides information about the northbound APIs, commands, and standard tools.

Table 5 Use Cases#

Use Case

API Call

Details

Default (all known platforms)


nvdebug -i <BMCIP> -u <username> -p <password> -t <platform>


Runs common and platform-specific log collectors using built-in common and platform specific JSON files.

Runs on all known platforms including NVIDIA HGX-H100x8, MGX, and so on that have Redfish and IPMI standard command support.
Common only on any platform


nvdebug -i <BMCIP> -u <username> -p <password> -t <platform>

-c
Runs only common log collectors using the built-in common JSON file.

Runs on any platform that has Redfish and IPMI standard command support.
nvdebug tool as a service


nvdebug -i <BMCIP> -u <username> -p <password> -t <platform>

-j <vendor_custom.json>
Collects all logs using user/vendor-defined JSON, functions, and proprietary tools.

This option allows vendors and nvdebug users to extend the log collection based on OEM-specific implementation and using proprietary tools.
Table 6 Logs Collected Using the Redfish APIs#

CID

Collector Name

Collection Level

Description

Platforms

R1

system_event_log

EventLogs from the Redfish computer system collection from the Host BMC.

All*

R2

manager_existing_log_dump

Existing log dumps that were created by the HMC/Host BMC or were initiated by a previous user from the BMC like the HMC and or the Host. BMC.

All*

R3

hgx_manager_on_demand_log_dump

User-initiated demand log dumps with a new request from the HMC.

The NVIDIA HGX-Baseboard

R4

manager_journal_log

Journal logs from the BMC, like the Host BMC or an HMC, that are running OpenBMC.

All**

R5

manager_fpga_register_dump

FPGA register dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC or Grace-based platforms.

All*

R6

manager_erot_dump

IRoT/ERoT debug log dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC or Grace-based platforms.

All*

R7


hgx_manager_self_test_report





Self-test logs collected on demand.

Users initiate the self-test of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC.
The NVIDIA HGX-Baseboard


R8

firmware_inventory

Base firmware inventory from the HMC and or the Host BMC.

All

R9

firmware_inventory_expand_query

Expanded firmware inventory from the HMC and or the Host BMC.

All

R10

chassis_info

Base Redfish chassis collection from the HMC and or the Host BMC.

All

R11

chassis_expand_query

Expanded Redfish chassis collection from the HMC and or the Host BMC.

All

R12

system_info

Base Redfish computer system collection from the HMC and or the Host BMC.

All

R13

system_expand_query

Expanded Redfish computer system collection from the HMC and or the Host BMC.

All

R14

manager_info

Base Redfish manager collection of the HMC and or the Host BMC.

All

R15

manager_expand_query

Expanded Redfish manager collection of the HMC and or the Host BMC.

All

R16

hgx_manager_retimer_dump

-VV

Retimer dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards using the HMC.

NVIDIA HGX-Baseboard

R17

dgx_manager_oem_log_dump

-VV

OEM log dump of DGX systems using the BMC.

The NVIDIA DGX

R18

telemetry_metric_reports

Telemetry metric reports from the HMC and or the Host BMC telemetry service.

All

R19

chassis_thermal_metrics

Thermal metrics from the HMC and or the Host BMC.

All

R20

firmware_inventory_table

Gets specific fields of the firmware inventory as listed in the config file (under FW_INVENTORY_TABLE_PROPERTIES) for all components and tabulates the fields.

All

R21

system_cper_logs

Collects the CPER logs for the system.

All arm64 platforms

R22

task_details

Collects details of all tasks in Redfish Task Service.

All

R23




nvlink_oob_logs




-VV




Collects OOB logs related to NVLink.

By default, collects System Event logs.

In the config file, URIs can be specified under NVLINK_OOB_URI.
All




R24

hgx_system_fw_attributes_dump

-VV

Firmware attribute dump of the NVIDIA baseboard through the HMC and or the Host BMC.

All***

R25


additional_oob_logs


-VV


Collect user-defined OOB logs from the BMC/HMC.

In the config file, URIs can be specified under ADDITIONAL_OOB_URI_COLLECTION.
All


R26

chassis_certificates

-VV

Collects the certificates from each iRoT/ERoT component on the chassis.

All

R27

spdm_erot_measurements

-VV

Collects SPDM measurements of iRoT/ERoT at indices 26 and 50.

NVIDA HGX-Baseboard

R28

hgx_system_hardware_checkout_dump

-VV

Hardware checkout dump of the NVIDIA baseboard using the BMC/HMC.

All***

R29

background_copy_status

-VV

Check the Root-of-Trust background copy status.

All

R30

software_inventory

-VV

Collects the software inventory from Redfish Update Service.

All

R31

hmc_fdr_log_dump

-VV

Collects the FDR log dump from the HMC.

NVIDIA HGX-Baseboard

R32

system_post_codes

-VV

Collects system POST codes.

All

R33

power_equipment_info

-VV

Collects the power equipment information.

PowerShelf

R34

power_equipment_expand_query

-VV

Collects the expanded power equipment information.

PowerShelf

R35

network_device_debug_dump

-VV

Collects the debug dump from the network devices.

NVIDIA HGX-Baseboard

R36

network_switch_debug_dump

-VV

Collects the debug dump from the network switches.

NVIDIA HGX-Baseboard

R37

gpu_debug_dump

-VV

Collects the debug dump from GPUs.

NVIDIA HGX-Baseboard

R38

gpu_diagnostic_dump

-VV

Collects the diagnostic dump from GPUs.

NVIDIA HGX-Baseboard

R39

sma_debug_dump

-VV

Collects the debug dump from System Management Agent (SMA) devices.

NVIDIA HGX-Baseboard

R40

custom_dump_service

-VV

Collects the dump from the custom URI and the payloads.

NVIDIA HGX-Baseboard

R41

chassis_thermal_subsystem_leak_detection

-VV

Collects the thermal subsystem leak detection data for the chassis.

NVIDIA HGX-Baseboard

Note

* - All platforms except NVIDIA DGX™.

** - Only OpenBMC firmware-based platforms.

*** - Only Platforms with HMC.

-VV - Collects Optional Logs that take longer to collect.

Table 7 Logs Collected Using the IPMI-Over-LAN#

CID

Collector Name

Description

Platforms

I1

mc_info

BMC information.

All

I2

lan_info

BMC network setting information.

All

I3

session_info

BMC session details.

All

I4

fru_info

Server component FRU information using the BMC.

All

I5

sdr_info

Basic sensor data records information.

All

I6

sel_info

System event log information.

All

I7

sensor_list

Sensor list information using the BMC.

All

I8

sel_list

SEL logs in text.

All

I9

sel_raw_dump

SEL logs in hex.

All

I10

chassis_status

Server chassis status and information.

All

I11

chassis_restart_cause

System Restart Cause stored by the BMC.

All

I12

user_list

IPMI user list.

All

I13

channel_info

IPMI channel information.

All

I14

sdr_elist

List of Sensor Data Records.

All

Table 8 Collected Server Host Logs#

CID

Collector Name

Collection Level

Description

Platforms

H1

node_dmesg

-V

dmesg from the host node.

All

H2

node_lspci

lspci details from the host node.

All

H3

node_smbios

General SMBIOS and SMBIOS type 9 details from the host node.

All

H4

node_lshw

Hardware and network details using lshw from the host node.

All Compute Trays

H5

node_nvidia_smi

-V

GPU information using nvidia-smi (if present). Not applicable for compute trays without GPUs (for example, MGX C2).

All Compute Trays

H6

node_kern_log

Kernel information and events on the node.

All

H7

node_crash_dump

Kernel crash dumps on the node.

All

H8

node_nvme_list

List of NVMe SSDs (if present).

All Compute Trays

H9

node_fabric_manager_log

Node fabric manager log (if present).

All

H10

node_nvflash_log*



Logs from the nvflash command (if present).
Not applicable for compute trays without GPUs (for example, MGX C2).
All Compute Trays

H11

nvidia_bug_report



Runs the nvidia-bug-report.sh script on the host with the --safe-mode and collects output.
If EXTRA_LOG_COLLECTION is True, also run it with --extra-system-data.
All Compute Trays

H12

nvos_inventory

(NVOS Only) Gets the platform software, firmware, and hardware inventory.

The NVIDIA NVSwitch™ Tray

H13

nvos_gnmi_config

(NVOS Only) Gets the gnmi status and configuration.

The NVIDIA NVSwitch™ Tray

H14

nvos_tech_support_dump

(NVOS Only) Generates and collects the tech-support dump on the system.

The NVIDIA NVSwitch™ Tray

H15

node_subnet_manager

Collects the subnet manager information from the host node.

All

H16

one_diag_dump

Collects the diagnostic dump information.

All

H17

node_nvme_log_dump

Collects the NVMe log dumps from the host node.

All

H18

node_os_info

Collects the OS information from the host node.

All

H19

node_memory_info

-V

Collects the Memory information from the host node.

All

H20

sos_report

-V

Collects the SOS report for the host.

All Compute Trays

H21

nvos_cli_dumps

-V

Collects the CLI dumps from the NVOS system.

The NVIDIA NVSwitch™ Tray

Note

* Requires nvflash to be in the/bin directory and to be added to PATH. The NVIDIA kernel drivers (nvidia_uvm, nvidia_drm, nvidia_modeset, nvidia_peermem, nvidia, and nvidia_fs) should also be unloaded before you run the log collector

-V - Collects logs at with greater verbosity.

Table 9 Logs Collected Using SSH to the Host BMC#

CID

Collector Name

Description Platforms

S1

bmc_status

The OpenBMC firmware status, which applies only to the Host BMC that is running the OpenBMC firmware.

All**

S2

bmc_dmesg

The BMC OS kernel dmesg information only through the Host BMC.

All

S3

network_info

The BMC network and routing settings only through the Host BMC.

All

S4

openbmc_stack_info

The OpenBMC OS and firmware stack information, which applies only to the Host BMC that is running OpenBMC firmware.

All**

S5

bmc_list_kernel_modules

The BMC OS Kernel loadable module information only through the Host BMC.

All

S6

openbmc_pldm_journal_log

The OpenBMC PLDM logs apply only to the Host BMC that is running the OpenBMC firmware.

All**

S7

i2c_device_list

The Host BMC i2c device list and its information only through the Host BMC.

All

S8

bmc_mem_cpu_utilization

The Host BMC Memory and CPU usage only through the Host BMC.

All

S9

openbmc_boot_status

The OpenBMC boot status, which applies only to the Host BMC that is running OpenBMC firmware.

All**

S10

var_log_dir_zip

The /var/log directory zip from the Host BMC.

All

S11

uptime

Gets the uptime through the Host BMC.

All

S12

fpga_register_table

Registers the FPGA table contents.

DGX

S13

hmc_boot_status

HMC Boot Status, which applies only to the platforms with the HMC.

DGX

S14

hmc_boot_progress

HMC Boot progress status, which applies only to the platforms with the HMC.

All***

S15

bmc_power_status

Get power fault register data from FPGA through the Host BMC only for platforms with NVIDIA FPGA.

All

S16

virtual_eeprom_data

Retrieves the telemetry data from the virtual EEPROM by using I2C commands.

All

S17

smbus_power_temperature_telemetry

Collects SMBus temperature, power, and staleness data.

All

S18

smbpbi_system_status

Collects the information for the SMBPBI telemetry, firmware, hardware status, PCIe link/error counts, fencing, and so on.

All

Note

** Only Open BMC firmware based platforms (same as Table 3).

*** Only Platforms with HMC (same as Table 3).

Table 10 System Health Check#

CID

Collector Name

Description

C1

out_of_band_health_check

Checks the health of various components and presence of mandatory fields using Redfish and IPMI.

Appendix B: The JSON Files#

This section provides information about the tool’s vendor JSON option.

Optional: DebugLog JSON-Vendor or User-Defined APIs and Proprietary Tools#

This option allows OEM/ODM partners to use nvdebug to add the required support by using proprietary tools and OEM APIs:

  • Vendors (OEM/ODM) need to copy the proprietary tools and the OEM-specific vendor_defined.json file to the directory in which nvdebug is located.

  • The vendor_defined.json file should contain the list of commands with the required parameters to run the tools/APIs from the command line.

Here is an example:

{

LogCollectorXYZ": "./xyz -u <user-name> -i <bmc-ip> ...,

LogCollectorABC": "abctool <param1> <param2> ...

}

Appendix C: Estimated Output Size#

The output zip file that contains the logs can vary in size, and the estimates by platform are listed in Table 8.

Table 11 Estimated nvdebug Output Size#

Platform

Size of the Zip

Size of the Unzipped Logs

Run time

x86_64

L1 - 7-10MB L2 - 60-70MB

L1 - 20-25MB, L2 - 200-210MB

10-15 minutes

arm64

L1 - 7-10MB L2 - 60-70MB

L1 - 20-25MB, L2 - 200-210MB

10-15 minutes

HGX-HMC

L1 - 14-18MB L2 - 120-130MB

L1 - 70-80MB, L2 - 280-290MB

15-20 minutes

DGX

L1 - 10-12MB L2 - 68-70MB

L1 - 17-18MB, L2 - 68-70MB

10-15 minutes

NVSwitch

L1 - 120-130MB L2 - 170-180MB

L1 - 160-165MB, L2 - 170-180MB

7-10 minutes

Note

Host-based node_dmesg, node_kern_logs can be arbitrarily large depending on how long the host has been up and running. This can cause the overall nvdebug output size to drastically increase.