Appendix#

Appendix A: Use Cases, IPMI, SSH APIs, and SSH Commands#

This section summarizes instructions, APIs, and commands used to gather debug information and logs for troubleshooting and provides information about the northbound APIs, commands, and standard tools.

Table 5 Use Cases#
Use Case	API Call	Details
Default (all known platforms)	nvdebug -i <BMCIP> -u <username> -p <password> -t <platform>	Runs common and platform-specific log collectors using built-in common and platform specific JSON files. Runs on all known platforms including NVIDIA HGX-H100x8, MGX, and so on that have Redfish and IPMI standard command support.
Common only on any platform	nvdebug -i <BMCIP> -u <username> -p <password> -t <platform> -c	Runs only common log collectors using the built-in common JSON file. Runs on any platform that has Redfish and IPMI standard command support.
nvdebug tool as a service	nvdebug -i <BMCIP> -u <username> -p <password> -t <platform> -j <vendor_custom.json>	Collects all logs using user/vendor-defined JSON, functions, and proprietary tools. This option allows vendors and nvdebug users to extend the log collection based on OEM-specific implementation and using proprietary tools.

Table 6 Logs Collected Using the Redfish APIs#
CID	Collector Name	Collection Level	Description	Platforms
R1	system_event_log		EventLogs from the Redfish computer system collection from the Host BMC.	All*
R2	manager_existing_log_dump		Existing log dumps that were created by the HMC/Host BMC or were initiated by a previous user from the BMC like the HMC and or the Host. BMC.	All*
R3	hgx_manager_on_demand_log_dump		User-initiated demand log dumps with a new request from the HMC.	The NVIDIA HGX-Baseboard
R4	manager_journal_log		Journal logs from the BMC, like the Host BMC or an HMC, that are running OpenBMC.	All**
R5	manager_fpga_register_dump		FPGA register dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC or Grace-based platforms.	All*
R6	manager_erot_dump		IRoT/ERoT debug log dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC or Grace-based platforms.	All*
R7	hgx_manager_self_test_report		Self-test logs collected on demand. Users initiate the self-test of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC.	The NVIDIA HGX-Baseboard
R8	firmware_inventory		Base firmware inventory from the HMC and or the Host BMC.	All
R9	firmware_inventory_expand_query		Expanded firmware inventory from the HMC and or the Host BMC.	All
R10	chassis_info		Base Redfish chassis collection from the HMC and or the Host BMC.	All
R11	chassis_expand_query		Expanded Redfish chassis collection from the HMC and or the Host BMC.	All
R12	system_info		Base Redfish computer system collection from the HMC and or the Host BMC.	All
R13	system_expand_query		Expanded Redfish computer system collection from the HMC and or the Host BMC.	All
R14	manager_info		Base Redfish manager collection of the HMC and or the Host BMC.	All
R15	manager_expand_query		Expanded Redfish manager collection of the HMC and or the Host BMC.	All
R16	hgx_manager_retimer_dump	-VV	Retimer dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards using the HMC.	NVIDIA HGX-Baseboard
R17	dgx_manager_oem_log_dump	-VV	OEM log dump of DGX systems using the BMC.	The NVIDIA DGX
R18	telemetry_metric_reports		Telemetry metric reports from the HMC and or the Host BMC telemetry service.	All
R19	chassis_thermal_metrics		Thermal metrics from the HMC and or the Host BMC.	All
R20	firmware_inventory_table		Gets specific fields of the firmware inventory as listed in the config file (under `FW_INVENTORY_TABLE_PROPERTIES`) for all components and tabulates the fields.	All
R21	system_cper_logs		Collects the CPER logs for the system.	All arm64 platforms
R22	task_details		Collects details of all tasks in Redfish Task Service.	All
R23	nvlink_oob_logs	-VV	Collects OOB logs related to NVLink. By default, collects System Event logs. In the config file, URIs can be specified under `NVLINK_OOB_URI`.	All
R24	hgx_system_fw_attributes_dump	-VV	Firmware attribute dump of the NVIDIA baseboard through the HMC and or the Host BMC.	All***
R25	additional_oob_logs	-VV	Collect user-defined OOB logs from the BMC/HMC. In the config file, URIs can be specified under `ADDITIONAL_OOB_URI_COLLECTION`.	All
R26	chassis_certificates	-VV	Collects the certificates from each iRoT/ERoT component on the chassis.	All
R27	spdm_erot_measurements	-VV	Collects SPDM measurements of iRoT/ERoT at indices 26 and 50.	NVIDA HGX-Baseboard
R28	hgx_system_hardware_checkout_dump	-VV	Hardware checkout dump of the NVIDIA baseboard using the BMC/HMC.	All***
R29	background_copy_status	-VV	Check the Root-of-Trust background copy status.	All
R30	software_inventory	-VV	Collects the software inventory from Redfish Update Service.	All
R31	hmc_fdr_log_dump	-VV	Collects the FDR log dump from the HMC.	NVIDIA HGX-Baseboard
R32	system_post_codes	-VV	Collects system POST codes.	All
R33	power_equipment_info	-VV	Collects the power equipment information.	PowerShelf
R34	power_equipment_expand_query	-VV	Collects the expanded power equipment information.	PowerShelf
R35	network_device_debug_dump	-VV	Collects the debug dump from the network devices.	NVIDIA HGX-Baseboard
R36	network_switch_debug_dump	-VV	Collects the debug dump from the network switches.	NVIDIA HGX-Baseboard
R37	gpu_debug_dump	-VV	Collects the debug dump from GPUs.	NVIDIA HGX-Baseboard
R38	gpu_diagnostic_dump	-VV	Collects the diagnostic dump from GPUs.	NVIDIA HGX-Baseboard
R39	sma_debug_dump	-VV	Collects the debug dump from System Management Agent (SMA) devices.	NVIDIA HGX-Baseboard
R40	custom_dump_service	-VV	Collects the dump from the custom URI and the payloads.	NVIDIA HGX-Baseboard
R41	chassis_thermal_subsystem_leak_detection	-VV	Collects the thermal subsystem leak detection data for the chassis.	NVIDIA HGX-Baseboard

Note

* - All platforms except NVIDIA DGX™.

** - Only OpenBMC firmware-based platforms.

*** - Only Platforms with HMC.

-VV - Collects Optional Logs that take longer to collect.

Table 7 Logs Collected Using the IPMI-Over-LAN#
CID	Collector Name	Description	Platforms
I1	mc_info	BMC information.	All
I2	lan_info	BMC network setting information.	All
I3	session_info	BMC session details.	All
I4	fru_info	Server component FRU information using the BMC.	All
I5	sdr_info	Basic sensor data records information.	All
I6	sel_info	System event log information.	All
I7	sensor_list	Sensor list information using the BMC.	All
I8	sel_list	SEL logs in text.	All
I9	sel_raw_dump	SEL logs in hex.	All
I10	chassis_status	Server chassis status and information.	All
I11	chassis_restart_cause	System Restart Cause stored by the BMC.	All
I12	user_list	IPMI user list.	All
I13	channel_info	IPMI channel information.	All
I14	sdr_elist	List of Sensor Data Records.	All

Table 8 Collected Server Host Logs#
CID	Collector Name	Collection Level	Description	Platforms
H1	node_dmesg	-V	dmesg from the host node.	All
H2	node_lspci		lspci details from the host node.	All
H3	node_smbios		General SMBIOS and SMBIOS type 9 details from the host node.	All
H4	node_lshw		Hardware and network details using lshw from the host node.	All Compute Trays
H5	node_nvidia_smi	-V	GPU information using nvidia-smi (if present). Not applicable for compute trays without GPUs (for example, MGX C2).	All Compute Trays
H6	node_kern_log		Kernel information and events on the node.	All
H7	node_crash_dump		Kernel crash dumps on the node.	All
H8	node_nvme_list		List of NVMe SSDs (if present).	All Compute Trays
H9	node_fabric_manager_log		Node fabric manager log (if present).	All
H10	node_nvflash_log*		Logs from the nvflash command (if present). Not applicable for compute trays without GPUs (for example, MGX C2).	All Compute Trays
H11	nvidia_bug_report		Runs the nvidia-bug-report.sh script on the host with the `--safe-mode` and collects output. If `EXTRA_LOG_COLLECTION` is True, also run it with `--extra-system-data`.	All Compute Trays
H12	nvos_inventory		(NVOS Only) Gets the platform software, firmware, and hardware inventory.	The NVIDIA NVSwitch™ Tray
H13	nvos_gnmi_config		(NVOS Only) Gets the gnmi status and configuration.	The NVIDIA NVSwitch™ Tray
H14	nvos_tech_support_dump		(NVOS Only) Generates and collects the tech-support dump on the system.	The NVIDIA NVSwitch™ Tray
H15	node_subnet_manager		Collects the subnet manager information from the host node.	All
H16	one_diag_dump		Collects the diagnostic dump information.	All
H17	node_nvme_log_dump		Collects the NVMe log dumps from the host node.	All
H18	node_os_info		Collects the OS information from the host node.	All
H19	node_memory_info	-V	Collects the Memory information from the host node.	All
H20	sos_report	-V	Collects the SOS report for the host.	All Compute Trays
H21	nvos_cli_dumps	-V	Collects the CLI dumps from the NVOS system.	The NVIDIA NVSwitch™ Tray

Note

* Requires nvflash to be in the/bin directory and to be added to PATH. The NVIDIA kernel drivers (nvidia_uvm, nvidia_drm, nvidia_modeset, nvidia_peermem, nvidia, and nvidia_fs) should also be unloaded before you run the log collector

-V - Collects logs at with greater verbosity.

Table 9 Logs Collected Using SSH to the Host BMC#
CID	Collector Name	Description Platforms
S1	bmc_status	The OpenBMC firmware status, which applies only to the Host BMC that is running the OpenBMC firmware.	All**
S2	bmc_dmesg	The BMC OS kernel dmesg information only through the Host BMC.	All
S3	network_info	The BMC network and routing settings only through the Host BMC.	All
S4	openbmc_stack_info	The OpenBMC OS and firmware stack information, which applies only to the Host BMC that is running OpenBMC firmware.	All**
S5	bmc_list_kernel_modules	The BMC OS Kernel loadable module information only through the Host BMC.	All
S6	openbmc_pldm_journal_log	The OpenBMC PLDM logs apply only to the Host BMC that is running the OpenBMC firmware.	All**
S7	i2c_device_list	The Host BMC i2c device list and its information only through the Host BMC.	All
S8	bmc_mem_cpu_utilization	The Host BMC Memory and CPU usage only through the Host BMC.	All
S9	openbmc_boot_status	The OpenBMC boot status, which applies only to the Host BMC that is running OpenBMC firmware.	All**
S10	var_log_dir_zip	The `/var/log directory` zip from the Host BMC.	All
S11	uptime	Gets the uptime through the Host BMC.	All
S12	fpga_register_table	Registers the FPGA table contents.	DGX
S13	hmc_boot_status	HMC Boot Status, which applies only to the platforms with the HMC.	DGX
S14	hmc_boot_progress	HMC Boot progress status, which applies only to the platforms with the HMC.	All***
S15	bmc_power_status	Get power fault register data from FPGA through the Host BMC only for platforms with NVIDIA FPGA.	All
S16	virtual_eeprom_data	Retrieves the telemetry data from the virtual EEPROM by using I2C commands.	All
S17	smbus_power_temperature_telemetry	Collects SMBus temperature, power, and staleness data.	All
S18	smbpbi_system_status	Collects the information for the SMBPBI telemetry, firmware, hardware status, PCIe link/error counts, fencing, and so on.	All

Note

** Only Open BMC firmware based platforms (same as Table 3).

*** Only Platforms with HMC (same as Table 3).

Table 10 System Health Check#
CID	Collector Name	Description
C1	out_of_band_health_check	Checks the health of various components and presence of mandatory fields using Redfish and IPMI.

Appendix B: The JSON Files#

This section provides information about the tool’s vendor JSON option.

Optional: DebugLog JSON-Vendor or User-Defined APIs and Proprietary Tools#

This option allows OEM/ODM partners to use nvdebug to add the required support by using proprietary tools and OEM APIs:

Vendors (OEM/ODM) need to copy the proprietary tools and the OEM-specific vendor_defined.json file to the directory in which nvdebug is located.
The vendor_defined.json file should contain the list of commands with the required parameters to run the tools/APIs from the command line.

Here is an example:

{

LogCollectorXYZ": "./xyz -u <user-name> -i <bmc-ip> ...,

LogCollectorABC": "abctool <param1> <param2> ...

}

Appendix C: Estimated Output Size#

The output zip file that contains the logs can vary in size, and the estimates by platform are listed in Table 8.

Table 11 Estimated nvdebug Output Size#
Platform	Size of the Zip	Size of the Unzipped Logs	Run time
x86_64	L1 - 7-10MB L2 - 60-70MB	L1 - 20-25MB, L2 - 200-210MB	10-15 minutes
arm64	L1 - 7-10MB L2 - 60-70MB	L1 - 20-25MB, L2 - 200-210MB	10-15 minutes
HGX-HMC	L1 - 14-18MB L2 - 120-130MB	L1 - 70-80MB, L2 - 280-290MB	15-20 minutes
DGX	L1 - 10-12MB L2 - 68-70MB	L1 - 17-18MB, L2 - 68-70MB	10-15 minutes
NVSwitch	L1 - 120-130MB L2 - 170-180MB	L1 - 160-165MB, L2 - 170-180MB	7-10 minutes

Note

Host-based node_dmesg, node_kern_logs can be arbitrarily large depending on how long the host has been up and running. This can cause the overall nvdebug output size to drastically increase.