Collector Reference#

This section provides a comprehensive reference for all NVDebug collectors organized by collection type.

Collection Levels#

NVDebug supports the following collection levels to control the scope of log collection:

  • L1 (Default): All necessary collectors. Always included.

  • L2 (-V): Default Log Collections + Increased Log Collection Level.

  • L3 (-VV): Default Log Collections + Increased Log Collection Level + Additional Collectors that can take a very long time (potentially hours) to run.

Redfish Collectors (R-Series)#

Redfish collectors gather data from BMCs using the Redfish REST API.

Table 6 Redfish Collectors#

CID

Name

Collection Level

Description

Platforms

R1

System Event Log

L1 (Default)

EventLogs from the Redfish computer system collection from the Host BMC; L1 - 10k entries (default); L2 - 15k entries (-V); L3 - 30k entries (-VV)

All (except DGX)

R2

Manager Existing Log Dump

L1 (Default)

Existing log dumps that were created by the HMC/Host BMC or were initiated by a previous user from the BMC

All (except DGX)

R3

HGX Manager On Demand Log Dump

L1 (Default)

User-initiated demand log dumps with a new request from the HMC

HGX-HMC, arm64, x86_64, DGX (specific baseboards)

R4

Manager Journal Log

L1 (Default)

Journal logs from the BMC, like the Host BMC or an HMC, that are running OpenBMC

HGX-HMC, arm64, x86_64 (specific baseboards)

R5

Manager FPGA Register Dump

L1 (Default)

FPGA register dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC or Grace-based platforms

HGX-HMC, arm64, x86_64 (specific baseboards)

R6

Manager ERoT Dump

L1 (Default)

IRoT/ERoT debug log dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC or Grace-based platforms

HGX-HMC, arm64, x86_64 (specific baseboards)

R7

HGX Manager Self Test Report

L1 (Default)

Self-test logs collected on demand. Users initiate the self-test of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC

HGX-HMC (specific baseboards)

R8

Firmware Inventory

L1 (Default)

Base firmware inventory from the HMC and or the Host BMC

All

R9

Firmware Inventory Expand Query

L1 (Default)

Expanded firmware inventory from the HMC and or the Host BMC

All

R10

Chassis Info

L1 (Default)

Base Redfish chassis collection from the HMC and or the Host BMC

All

R11

Chassis Expand Query

L1 (Default)

Expanded Redfish chassis collection from the HMC and or the Host BMC

All

R12

System Info

L1 (Default)

Base Redfish computer system collection from the HMC and or the Host BMC

All (except PowerShelf)

R13

System Expand Query

L1 (Default)

Expanded Redfish computer system collection from the HMC and or the Host BMC

All (except PowerShelf)

R14

Manager Info

L1 (Default)

Base Redfish manager collection of the HMC and or the Host BMC

All

R15

Manager Expand Query

L1 (Default)

Expanded Redfish manager collection of the HMC and or the Host BMC

All

R16

HGX Manager Retimer Dump

L3 (-VV)

Retimer dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards using the HMC

HGX-HMC, arm64, x86_64 (specific baseboards)

R17

DGX Manager OEM Log Dump

L3 (-VV)

OEM log dump of DGX systems using the BMC

DGX

R18

Telemetry Metric Reports

L1 (Default)

Telemetry metric reports from the HMC and or the Host BMC telemetry service

All (except NVSwitch, PowerShelf)

R19

Chassis Thermal Metrics

L1 (Default)

Thermal metrics from the HMC and or the Host BMC

All

R20

Firmware Inventory Table

L1 (Default)

Gets specific fields of the firmware inventory as listed in the config file (under FW_INVENTORY_TABLE_PROPERTIES) for all components and tabulates the fields

All

R21

System CPER Logs

L1 (Default)

Collects the CPER logs for the system

arm64 platforms

R22

Task Details

L1 (Default)

Collects details of all tasks in Redfish Task Service

All

R23

NVLink OOB Logs

L3 (-VV)

Collects OOB logs related to NVLink. By default, collects System Event logs. In the config file, URIs can be specified under NVLINK_OOB_URI

All (except NVSwitch, PowerShelf)

R24

HGX System FW Attributes Dump

L3 (-VV)

Firmware attribute dump of the NVIDIA baseboard through the HMC and or the Host BMC

HGX-HMC, arm64, x86_64

R25

Additional OOB Logs

L3 (-VV)

Collect user-defined OOB logs from the BMC/HMC. In the config file, URIs can be specified under ADDITIONAL_OOB_URI_COLLECTION

All

R26

Chassis Certificates

L3 (-VV)

Collects the certificates from each iRoT/ERoT component on the chassis

All

R27

SPDM ERoT Measurements

L3 (-VV)

Collects SPDM measurements of iRoT/ERoT at indices 26 and 50

HGX-HMC, arm64, x86_64 (specific baseboards)

R28

HGX System Hardware Checkout Dump

L3 (-VV)

Hardware checkout dump of the NVIDIA baseboard using the BMC/HMC

HGX-HMC, arm64, x86_64

R29

Background Copy Status

L3 (-VV)

Check the Root-of-Trust background copy status

All (except DGX, PowerShelf)

R30

Software Inventory

L3 (-VV)

Collects the software inventory from Redfish Update Service

All (except PowerShelf)

R31

HMC FDR Log Dump

L3 (-VV)

Collects the FDR log dump from the HMC

HGX-HMC, arm64, x86_64 (specific baseboards)

R32

System POST Codes

L3 (-VV)

Collects system POST codes

All (except NVSwitch, PowerShelf, HGX-HMC)

R33

Power Equipment Info

L1 (Default)

Collects the power equipment information

PowerShelf

R34

Power Equipment Expand Query

L1 (Default)

Collects the expanded power equipment information

PowerShelf

R35

Network Device Debug Dump

L3 (-VV)

Collects the debug dump from the network devices

HGX-HMC, arm64, x86_64 (specific baseboards)

R36

Network Switch Debug Dump

L3 (-VV)

Collects the debug dump from the network switches

HGX-HMC, arm64, x86_64 (specific baseboards)

R37

GPU Debug Dump

L3 (-VV)

Collects the debug dump from GPUs

HGX-HMC, arm64, x86_64 (specific baseboards)

R38

GPU Diagnostic Dump

L3 (-VV)

Collects the diagnostic dump from GPUs

HGX-HMC, arm64, x86_64 (specific baseboards)

R39

SMA Debug Dump

L3 (-VV)

Collects the debug dump from System Management Agent (SMA) devices

HGX-HMC, arm64, x86_64 (specific baseboards)

R40

Custom Dump Service

L3 (-VV)

Collects the dump from the custom URI and the payloads

All

R41

Chassis Thermal Subsystem Leak Detection

L3 (-VV)

Collects the thermal subsystem leak detection data for the chassis

HGX-HMC, arm64, x86_64 (specific baseboards)

R42

NSM Dump

L1 (Default)

Collects NSM (Network System Manager) dump information

HGX-HMC, arm64, x86_64 (specific baseboards)

IPMI Collectors (I-Series)#

IPMI collectors gather data using the Intelligent Platform Management Interface (IPMI).

Table 7 IPMI Collectors#

CID

Name

Collection Level

Description

Platforms

I1

MC Info

L1 (Default)

BMC information

All (except NVSwitch, PowerShelf)

I2

LAN Info

L1 (Default)

BMC network setting information

All (except NVSwitch, PowerShelf)

I3

Session Info

L1 (Default)

BMC session details

All (except NVSwitch, PowerShelf)

I4

FRU Info

L1 (Default)

Server component FRU information using the BMC

All (except NVSwitch, PowerShelf)

I5

SDR Info

L1 (Default)

Basic sensor data records information

All (except NVSwitch, PowerShelf)

I6

SEL Info

L1 (Default)

System event log information

All (except NVSwitch, PowerShelf)

I7

Sensor List

L1 (Default)

Sensor list information using the BMC

All (except NVSwitch, PowerShelf)

I8

SEL List

L1 (Default)

SEL logs in text

All (except NVSwitch, PowerShelf)

I9

SEL Raw Dump

L1 (Default)

SEL logs in hex

All (except NVSwitch, PowerShelf)

I10

Chassis Status

L1 (Default)

Server chassis status and information

All (except NVSwitch, PowerShelf)

I11

Chassis Restart Cause

L1 (Default)

System Restart Cause stored by the BMC

All (except NVSwitch, PowerShelf)

I12

User List

L1 (Default)

IPMI user list

All (except NVSwitch, PowerShelf)

I13

Channel Info

L1 (Default)

IPMI channel information

All (except NVSwitch, PowerShelf)

I14

SDR Elist

L1 (Default)

List of Sensor Data Records

All (except NVSwitch, PowerShelf)

SSH Collectors (S-Series)#

SSH collectors gather data by executing commands on the BMC via SSH.

Table 8 SSH Collectors#

CID

Name

Collection Level

Description

Platforms

S1

BMC Status

L1 (Default)

The OpenBMC firmware status, which applies only to the Host BMC that is running the OpenBMC firmware

arm64, x86_64, NVSwitch

S2

BMC Dmesg

L1 (Default)

The BMC OS kernel dmesg information only through the Host BMC

All (except HGX-HMC, PowerShelf)

S3

Network Info

L1 (Default)

The BMC network and routing settings only through the Host BMC

All (except HGX-HMC, PowerShelf)

S4

OpenBMC Stack Info

L1 (Default)

The OpenBMC OS and firmware stack information, which applies only to the Host BMC that is running OpenBMC firmware

arm64, x86_64, NVSwitch

S5

BMC List Kernel Modules

L1 (Default)

The BMC OS Kernel loadable module information only through the Host BMC

All (except HGX-HMC, PowerShelf)

S6

OpenBMC PLDM Journal Log

L1 (Default)

The OpenBMC PLDM logs apply only to the Host BMC that is running the OpenBMC firmware

arm64, x86_64

S7

I2C Device List

L1 (Default)

The Host BMC i2c device list and its information only through the Host BMC

arm64, x86_64, NVSwitch

S8

BMC Mem CPU Utilization

L1 (Default)

The Host BMC Memory and CPU usage only through the Host BMC

All (except HGX-HMC, PowerShelf)

S9

OpenBMC Boot Status

L1 (Default)

The OpenBMC boot status, which applies only to the Host BMC that is running OpenBMC firmware

arm64, x86_64, NVSwitch

S10

Var Log Dir Zip

L1 (Default)

The /var/log directory zip from the Host BMC

arm64, x86_64, NVSwitch

S11

Uptime

L1 (Default)

Gets the uptime through the Host BMC

All (except HGX-HMC, PowerShelf)

S12

FPGA Register Table

L1 (Default)

Registers the FPGA table contents

DGX

S13

HMC Boot Status

L1 (Default)

HMC Boot Status, which applies only to the platforms with the HMC

DGX

S14

HMC Boot Progress

L1 (Default)

HMC Boot progress status, which applies only to the platforms with the HMC, for Hopper Generation Platforms set HMC_I2C_ADDRESS to 0x54, for Blackwell Generation Platforms set HMC_I2C_ADDRESS to 0x4F

HGX-HMC, arm64 (specific baseboards)

S15

BMC Power Status

L1 (Default)

Get power fault register data from FPGA through the Host BMC only for platforms with NVIDIA FPGA

All (except NVSwitch, PowerShelf)

S16

Virtual EEPROM Data

L1 (Default)

Retrieves the telemetry data from the virtual EEPROM by using I2C commands

HGX-HMC, arm64, x86_64 (specific baseboards)

S17

SMBus Power Temperature Telemetry

L1 (Default)

Collects SMBus temperature, power, and staleness data

arm64, x86_64 (specific baseboards)

S18

SMBPBI System Status

L1 (Default)

Collects the information for the SMBPBI telemetry, firmware, hardware status, PCIe link/error counts, fencing, and so on

arm64, x86_64 (specific baseboards)

S19

One Click

L1 (Default)

One-click OOB log collection with timestamp

All

Host Collectors (H-Series)#

Host collectors gather data directly from the host system.

Table 9 Host Collectors#

CID

Name

Collection Level

Description

Platforms

H1

Node Dmesg

L2 (-V)

dmesg from the host node

All (except NVSwitch, PowerShelf)

H2

Node Lspci

L1 (Default)

lspci details from the host node

All (except NVSwitch, PowerShelf)

H3

Node SMBIOS

L1 (Default)

General SMBIOS and SMBIOS type 9 details from the host node

All (except NVSwitch, PowerShelf)

H4

Node Lshw

L1 (Default)

Hardware and network details using lshw from the host node

All (except NVSwitch, PowerShelf)

H5

Node NVIDIA SMI

L1 (Default)

GPU information using nvidia-smi (if present). Not applicable for compute trays without GPUs (for example, MGX C2)

All (except NVSwitch, PowerShelf)

H6

Node Kern Log

L2 (-V)

Kernel information and events on the node

All (except NVSwitch, PowerShelf)

H7

Node Crash Dump

L3 (-VV)

Kernel crash dumps on the node

All (except NVSwitch, PowerShelf)

H8

Node NVMe List

L1 (Default)

List of NVMe SSDs (if present)

All (except NVSwitch, PowerShelf)

H9

Node Fabric Manager Log

L1 (Default)

Node fabric manager log (if present)

HGX-HMC, x86_64 (specific baseboards)

H10

Node NVFlash Log

L3 (-VV)

Logs from the nvflash command (if present). Not applicable for compute trays without GPUs (for example, MGX C2)

All (except NVSwitch, PowerShelf)

H11

NVIDIA Bug Report

L1 (Default)

Runs the nvidia-bug-report.sh script on the host with the –safe-mode and collects output. If EXTRA_LOG_COLLECTION is True, also run it with –extra-system-data

All (except NVSwitch, PowerShelf)

H12

NVOS Inventory

L1 (Default)

(NVOS Only) Gets the platform software, firmware, and hardware inventory

NVSwitch

H13

NVOS GNMI Config

L1 (Default)

(NVOS Only) Gets the gnmi status and configuration

NVSwitch

H14

NVOS Tech Support Dump

L1 (Default)

(NVOS Only) Generates and collects the tech-support dump on the system

NVSwitch

H15

Node Subnet Manager

L3 (-VV)

Collects the subnet manager information from the host node

HGX-HMC, arm64, x86_64 (specific baseboards)

H16

One Diag Dump

L3 (-VV)

Collects the diagnostic dump information

All (except NVSwitch, PowerShelf)

H17

Node NVMe Log Dump

L3 (-VV)

Collects the NVMe log dumps from the host node

All (except NVSwitch, PowerShelf)

H18

Node OS Info

L1 (Default)

Collects the OS information from the host node

All (except NVSwitch, PowerShelf)

H19

Node Memory Info

L2 (-V)

Collects the Memory information from the host node

All (except PowerShelf)

H20

SOS Report

L2 (-V)

Collects the SOS report for the host

All (except NVSwitch, PowerShelf)

H21

NVOS CLI Dumps

L2 (-V)

Collects the CLI dumps from the NVOS system

NVSwitch

HealthCheck Collectors (C-Series)#

HealthCheck collectors perform system health diagnostics.

Table 10 HealthCheck Collectors#

CID

Name

Description

Platforms

C1

Out Of Band Health Check

Checks the health of various components and presence of mandatory fields using Redfish and IPMI

All (except NVSwitch, PowerShelf)

Collection Strategies#

Single Node Collection:

Collects logs from a single server node. Use this for individual system diagnostics.

Multi-Node Collection:

Collects logs from multiple nodes in a cluster. Use this for cluster-wide diagnostics.

Rack-Level Collection:

Collects logs from all nodes in a rack. Use this for rack-level diagnostics. See Configuration Guide for rack configuration details.

Best Practices#

Collection Level Selection:
  • Start with L1 (Default) level for most diagnostics

  • Use L2 (-V) for more detailed analysis

  • Use L3 (-VV) only when extensive logging is required

Timeout Considerations:
  • Higher collection levels may take significantly longer

  • Monitor collection progress and adjust timeouts as needed

  • Consider network stability for remote collections

Storage Requirements:
  • L1 level: ~100MB-500MB per node

  • L2 level: ~500MB-1GB per node

  • L3 level: ~1GB-5GB per node

Network Considerations:
  • Ensure stable network connectivity

  • Consider bandwidth limitations for large collections

  • Plan for potential network interruptions