Collector Reference#
This section provides a comprehensive reference for all NVDebug collectors organized by collection type.
Collection Levels#
NVDebug supports the following collection levels to control the scope of log collection:
L1 (Default): All necessary collectors. Always included.
L2 (-V): Default Log Collections + Increased Log Collection Level.
L3 (-VV): Default Log Collections + Increased Log Collection Level + Additional Collectors that can take a very long time (potentially hours) to run.
Redfish Collectors (R-Series)#
Redfish collectors gather data from BMCs using the Redfish REST API.
CID |
Name |
Collection Level |
Description |
Platforms |
---|---|---|---|---|
R1 |
System Event Log |
L1 (Default) |
EventLogs from the Redfish computer system collection from the Host BMC; L1 - 10k entries (default); L2 - 15k entries (-V); L3 - 30k entries (-VV) |
All (except DGX) |
R2 |
Manager Existing Log Dump |
L1 (Default) |
Existing log dumps that were created by the HMC/Host BMC or were initiated by a previous user from the BMC |
All (except DGX) |
R3 |
HGX Manager On Demand Log Dump |
L1 (Default) |
User-initiated demand log dumps with a new request from the HMC |
HGX-HMC, arm64, x86_64, DGX (specific baseboards) |
R4 |
Manager Journal Log |
L1 (Default) |
Journal logs from the BMC, like the Host BMC or an HMC, that are running OpenBMC |
HGX-HMC, arm64, x86_64 (specific baseboards) |
R5 |
Manager FPGA Register Dump |
L1 (Default) |
FPGA register dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC or Grace-based platforms |
HGX-HMC, arm64, x86_64 (specific baseboards) |
R6 |
Manager ERoT Dump |
L1 (Default) |
IRoT/ERoT debug log dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC or Grace-based platforms |
HGX-HMC, arm64, x86_64 (specific baseboards) |
R7 |
HGX Manager Self Test Report |
L1 (Default) |
Self-test logs collected on demand. Users initiate the self-test of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards through the HMC |
HGX-HMC (specific baseboards) |
R8 |
Firmware Inventory |
L1 (Default) |
Base firmware inventory from the HMC and or the Host BMC |
All |
R9 |
Firmware Inventory Expand Query |
L1 (Default) |
Expanded firmware inventory from the HMC and or the Host BMC |
All |
R10 |
Chassis Info |
L1 (Default) |
Base Redfish chassis collection from the HMC and or the Host BMC |
All |
R11 |
Chassis Expand Query |
L1 (Default) |
Expanded Redfish chassis collection from the HMC and or the Host BMC |
All |
R12 |
System Info |
L1 (Default) |
Base Redfish computer system collection from the HMC and or the Host BMC |
All (except PowerShelf) |
R13 |
System Expand Query |
L1 (Default) |
Expanded Redfish computer system collection from the HMC and or the Host BMC |
All (except PowerShelf) |
R14 |
Manager Info |
L1 (Default) |
Base Redfish manager collection of the HMC and or the Host BMC |
All |
R15 |
Manager Expand Query |
L1 (Default) |
Expanded Redfish manager collection of the HMC and or the Host BMC |
All |
R16 |
HGX Manager Retimer Dump |
L3 (-VV) |
Retimer dump of the NVIDIA HGX-Hopper/Blackwell x8 GPU baseboards using the HMC |
HGX-HMC, arm64, x86_64 (specific baseboards) |
R17 |
DGX Manager OEM Log Dump |
L3 (-VV) |
OEM log dump of DGX systems using the BMC |
DGX |
R18 |
Telemetry Metric Reports |
L1 (Default) |
Telemetry metric reports from the HMC and or the Host BMC telemetry service |
All (except NVSwitch, PowerShelf) |
R19 |
Chassis Thermal Metrics |
L1 (Default) |
Thermal metrics from the HMC and or the Host BMC |
All |
R20 |
Firmware Inventory Table |
L1 (Default) |
Gets specific fields of the firmware inventory as listed in the config file (under FW_INVENTORY_TABLE_PROPERTIES) for all components and tabulates the fields |
All |
R21 |
System CPER Logs |
L1 (Default) |
Collects the CPER logs for the system |
arm64 platforms |
R22 |
Task Details |
L1 (Default) |
Collects details of all tasks in Redfish Task Service |
All |
R23 |
NVLink OOB Logs |
L3 (-VV) |
Collects OOB logs related to NVLink. By default, collects System Event logs. In the config file, URIs can be specified under NVLINK_OOB_URI |
All (except NVSwitch, PowerShelf) |
R24 |
HGX System FW Attributes Dump |
L3 (-VV) |
Firmware attribute dump of the NVIDIA baseboard through the HMC and or the Host BMC |
HGX-HMC, arm64, x86_64 |
R25 |
Additional OOB Logs |
L3 (-VV) |
Collect user-defined OOB logs from the BMC/HMC. In the config file, URIs can be specified under ADDITIONAL_OOB_URI_COLLECTION |
All |
R26 |
Chassis Certificates |
L3 (-VV) |
Collects the certificates from each iRoT/ERoT component on the chassis |
All |
R27 |
SPDM ERoT Measurements |
L3 (-VV) |
Collects SPDM measurements of iRoT/ERoT at indices 26 and 50 |
HGX-HMC, arm64, x86_64 (specific baseboards) |
R28 |
HGX System Hardware Checkout Dump |
L3 (-VV) |
Hardware checkout dump of the NVIDIA baseboard using the BMC/HMC |
HGX-HMC, arm64, x86_64 |
R29 |
Background Copy Status |
L3 (-VV) |
Check the Root-of-Trust background copy status |
All (except DGX, PowerShelf) |
R30 |
Software Inventory |
L3 (-VV) |
Collects the software inventory from Redfish Update Service |
All (except PowerShelf) |
R31 |
HMC FDR Log Dump |
L3 (-VV) |
Collects the FDR log dump from the HMC |
HGX-HMC, arm64, x86_64 (specific baseboards) |
R32 |
System POST Codes |
L3 (-VV) |
Collects system POST codes |
All (except NVSwitch, PowerShelf, HGX-HMC) |
R33 |
Power Equipment Info |
L1 (Default) |
Collects the power equipment information |
PowerShelf |
R34 |
Power Equipment Expand Query |
L1 (Default) |
Collects the expanded power equipment information |
PowerShelf |
R35 |
Network Device Debug Dump |
L3 (-VV) |
Collects the debug dump from the network devices |
HGX-HMC, arm64, x86_64 (specific baseboards) |
R36 |
Network Switch Debug Dump |
L3 (-VV) |
Collects the debug dump from the network switches |
HGX-HMC, arm64, x86_64 (specific baseboards) |
R37 |
GPU Debug Dump |
L3 (-VV) |
Collects the debug dump from GPUs |
HGX-HMC, arm64, x86_64 (specific baseboards) |
R38 |
GPU Diagnostic Dump |
L3 (-VV) |
Collects the diagnostic dump from GPUs |
HGX-HMC, arm64, x86_64 (specific baseboards) |
R39 |
SMA Debug Dump |
L3 (-VV) |
Collects the debug dump from System Management Agent (SMA) devices |
HGX-HMC, arm64, x86_64 (specific baseboards) |
R40 |
Custom Dump Service |
L3 (-VV) |
Collects the dump from the custom URI and the payloads |
All |
R41 |
Chassis Thermal Subsystem Leak Detection |
L3 (-VV) |
Collects the thermal subsystem leak detection data for the chassis |
HGX-HMC, arm64, x86_64 (specific baseboards) |
R42 |
NSM Dump |
L1 (Default) |
Collects NSM (Network System Manager) dump information |
HGX-HMC, arm64, x86_64 (specific baseboards) |
IPMI Collectors (I-Series)#
IPMI collectors gather data using the Intelligent Platform Management Interface (IPMI).
CID |
Name |
Collection Level |
Description |
Platforms |
---|---|---|---|---|
I1 |
MC Info |
L1 (Default) |
BMC information |
All (except NVSwitch, PowerShelf) |
I2 |
LAN Info |
L1 (Default) |
BMC network setting information |
All (except NVSwitch, PowerShelf) |
I3 |
Session Info |
L1 (Default) |
BMC session details |
All (except NVSwitch, PowerShelf) |
I4 |
FRU Info |
L1 (Default) |
Server component FRU information using the BMC |
All (except NVSwitch, PowerShelf) |
I5 |
SDR Info |
L1 (Default) |
Basic sensor data records information |
All (except NVSwitch, PowerShelf) |
I6 |
SEL Info |
L1 (Default) |
System event log information |
All (except NVSwitch, PowerShelf) |
I7 |
Sensor List |
L1 (Default) |
Sensor list information using the BMC |
All (except NVSwitch, PowerShelf) |
I8 |
SEL List |
L1 (Default) |
SEL logs in text |
All (except NVSwitch, PowerShelf) |
I9 |
SEL Raw Dump |
L1 (Default) |
SEL logs in hex |
All (except NVSwitch, PowerShelf) |
I10 |
Chassis Status |
L1 (Default) |
Server chassis status and information |
All (except NVSwitch, PowerShelf) |
I11 |
Chassis Restart Cause |
L1 (Default) |
System Restart Cause stored by the BMC |
All (except NVSwitch, PowerShelf) |
I12 |
User List |
L1 (Default) |
IPMI user list |
All (except NVSwitch, PowerShelf) |
I13 |
Channel Info |
L1 (Default) |
IPMI channel information |
All (except NVSwitch, PowerShelf) |
I14 |
SDR Elist |
L1 (Default) |
List of Sensor Data Records |
All (except NVSwitch, PowerShelf) |
SSH Collectors (S-Series)#
SSH collectors gather data by executing commands on the BMC via SSH.
CID |
Name |
Collection Level |
Description |
Platforms |
---|---|---|---|---|
S1 |
BMC Status |
L1 (Default) |
The OpenBMC firmware status, which applies only to the Host BMC that is running the OpenBMC firmware |
arm64, x86_64, NVSwitch |
S2 |
BMC Dmesg |
L1 (Default) |
The BMC OS kernel dmesg information only through the Host BMC |
All (except HGX-HMC, PowerShelf) |
S3 |
Network Info |
L1 (Default) |
The BMC network and routing settings only through the Host BMC |
All (except HGX-HMC, PowerShelf) |
S4 |
OpenBMC Stack Info |
L1 (Default) |
The OpenBMC OS and firmware stack information, which applies only to the Host BMC that is running OpenBMC firmware |
arm64, x86_64, NVSwitch |
S5 |
BMC List Kernel Modules |
L1 (Default) |
The BMC OS Kernel loadable module information only through the Host BMC |
All (except HGX-HMC, PowerShelf) |
S6 |
OpenBMC PLDM Journal Log |
L1 (Default) |
The OpenBMC PLDM logs apply only to the Host BMC that is running the OpenBMC firmware |
arm64, x86_64 |
S7 |
I2C Device List |
L1 (Default) |
The Host BMC i2c device list and its information only through the Host BMC |
arm64, x86_64, NVSwitch |
S8 |
BMC Mem CPU Utilization |
L1 (Default) |
The Host BMC Memory and CPU usage only through the Host BMC |
All (except HGX-HMC, PowerShelf) |
S9 |
OpenBMC Boot Status |
L1 (Default) |
The OpenBMC boot status, which applies only to the Host BMC that is running OpenBMC firmware |
arm64, x86_64, NVSwitch |
S10 |
Var Log Dir Zip |
L1 (Default) |
The /var/log directory zip from the Host BMC |
arm64, x86_64, NVSwitch |
S11 |
Uptime |
L1 (Default) |
Gets the uptime through the Host BMC |
All (except HGX-HMC, PowerShelf) |
S12 |
FPGA Register Table |
L1 (Default) |
Registers the FPGA table contents |
DGX |
S13 |
HMC Boot Status |
L1 (Default) |
HMC Boot Status, which applies only to the platforms with the HMC |
DGX |
S14 |
HMC Boot Progress |
L1 (Default) |
HMC Boot progress status, which applies only to the platforms with the HMC, for Hopper Generation Platforms set HMC_I2C_ADDRESS to 0x54, for Blackwell Generation Platforms set HMC_I2C_ADDRESS to 0x4F |
HGX-HMC, arm64 (specific baseboards) |
S15 |
BMC Power Status |
L1 (Default) |
Get power fault register data from FPGA through the Host BMC only for platforms with NVIDIA FPGA |
All (except NVSwitch, PowerShelf) |
S16 |
Virtual EEPROM Data |
L1 (Default) |
Retrieves the telemetry data from the virtual EEPROM by using I2C commands |
HGX-HMC, arm64, x86_64 (specific baseboards) |
S17 |
SMBus Power Temperature Telemetry |
L1 (Default) |
Collects SMBus temperature, power, and staleness data |
arm64, x86_64 (specific baseboards) |
S18 |
SMBPBI System Status |
L1 (Default) |
Collects the information for the SMBPBI telemetry, firmware, hardware status, PCIe link/error counts, fencing, and so on |
arm64, x86_64 (specific baseboards) |
S19 |
One Click |
L1 (Default) |
One-click OOB log collection with timestamp |
All |
Host Collectors (H-Series)#
Host collectors gather data directly from the host system.
CID |
Name |
Collection Level |
Description |
Platforms |
---|---|---|---|---|
H1 |
Node Dmesg |
L2 (-V) |
dmesg from the host node |
All (except NVSwitch, PowerShelf) |
H2 |
Node Lspci |
L1 (Default) |
lspci details from the host node |
All (except NVSwitch, PowerShelf) |
H3 |
Node SMBIOS |
L1 (Default) |
General SMBIOS and SMBIOS type 9 details from the host node |
All (except NVSwitch, PowerShelf) |
H4 |
Node Lshw |
L1 (Default) |
Hardware and network details using lshw from the host node |
All (except NVSwitch, PowerShelf) |
H5 |
Node NVIDIA SMI |
L1 (Default) |
GPU information using nvidia-smi (if present). Not applicable for compute trays without GPUs (for example, MGX C2) |
All (except NVSwitch, PowerShelf) |
H6 |
Node Kern Log |
L2 (-V) |
Kernel information and events on the node |
All (except NVSwitch, PowerShelf) |
H7 |
Node Crash Dump |
L3 (-VV) |
Kernel crash dumps on the node |
All (except NVSwitch, PowerShelf) |
H8 |
Node NVMe List |
L1 (Default) |
List of NVMe SSDs (if present) |
All (except NVSwitch, PowerShelf) |
H9 |
Node Fabric Manager Log |
L1 (Default) |
Node fabric manager log (if present) |
HGX-HMC, x86_64 (specific baseboards) |
H10 |
Node NVFlash Log |
L3 (-VV) |
Logs from the nvflash command (if present). Not applicable for compute trays without GPUs (for example, MGX C2) |
All (except NVSwitch, PowerShelf) |
H11 |
NVIDIA Bug Report |
L1 (Default) |
Runs the nvidia-bug-report.sh script on the host with the –safe-mode and collects output. If EXTRA_LOG_COLLECTION is True, also run it with –extra-system-data |
All (except NVSwitch, PowerShelf) |
H12 |
NVOS Inventory |
L1 (Default) |
(NVOS Only) Gets the platform software, firmware, and hardware inventory |
NVSwitch |
H13 |
NVOS GNMI Config |
L1 (Default) |
(NVOS Only) Gets the gnmi status and configuration |
NVSwitch |
H14 |
NVOS Tech Support Dump |
L1 (Default) |
(NVOS Only) Generates and collects the tech-support dump on the system |
NVSwitch |
H15 |
Node Subnet Manager |
L3 (-VV) |
Collects the subnet manager information from the host node |
HGX-HMC, arm64, x86_64 (specific baseboards) |
H16 |
One Diag Dump |
L3 (-VV) |
Collects the diagnostic dump information |
All (except NVSwitch, PowerShelf) |
H17 |
Node NVMe Log Dump |
L3 (-VV) |
Collects the NVMe log dumps from the host node |
All (except NVSwitch, PowerShelf) |
H18 |
Node OS Info |
L1 (Default) |
Collects the OS information from the host node |
All (except NVSwitch, PowerShelf) |
H19 |
Node Memory Info |
L2 (-V) |
Collects the Memory information from the host node |
All (except PowerShelf) |
H20 |
SOS Report |
L2 (-V) |
Collects the SOS report for the host |
All (except NVSwitch, PowerShelf) |
H21 |
NVOS CLI Dumps |
L2 (-V) |
Collects the CLI dumps from the NVOS system |
NVSwitch |
HealthCheck Collectors (C-Series)#
HealthCheck collectors perform system health diagnostics.
CID |
Name |
Description |
Platforms |
---|---|---|---|
C1 |
Out Of Band Health Check |
Checks the health of various components and presence of mandatory fields using Redfish and IPMI |
All (except NVSwitch, PowerShelf) |
Collection Strategies#
- Single Node Collection:
Collects logs from a single server node. Use this for individual system diagnostics.
- Multi-Node Collection:
Collects logs from multiple nodes in a cluster. Use this for cluster-wide diagnostics.
- Rack-Level Collection:
Collects logs from all nodes in a rack. Use this for rack-level diagnostics. See Configuration Guide for rack configuration details.
Best Practices#
- Collection Level Selection:
Start with L1 (Default) level for most diagnostics
Use L2 (-V) for more detailed analysis
Use L3 (-VV) only when extensive logging is required
- Timeout Considerations:
Higher collection levels may take significantly longer
Monitor collection progress and adjust timeouts as needed
Consider network stability for remote collections
- Storage Requirements:
L1 level: ~100MB-500MB per node
L2 level: ~500MB-1GB per node
L3 level: ~1GB-5GB per node
- Network Considerations:
Ensure stable network connectivity
Consider bandwidth limitations for large collections
Plan for potential network interruptions