System Requirements#

This section outlines the system requirements necessary to run NVDebug, including client, server, NVSwitch, and baseboard management controller requirements as well as performance requirements. It also highlights known limitations that may affect functionality or compatibility.

Client Host Requirements#

The client host must meet the following requirements:

Operating System:

  • Linux-based operating system

  • Linux Kernel 4.4 or later (4.15 or later recommended)

  • GNU C Library (glibc) 2.17 or later

Recommended Operating Systems:

  • Ubuntu 24.04 (recommended)

  • Ubuntu 20.04 or later (supported)

Supported Operating Systems:

  • RHEL/CentOS 8+

  • SLES 15+

Hardware Requirements:

  • Minimum 4GB RAM

  • 2GB free disk space

  • Network connectivity to target systems

Required Packages:

  • Python 3.12

  • ipmitool

  • sshpass

Network Requirements:

  • Access to server device under test (DUT) via BMC using Redfish and IPMI-over-LAN

  • Out-of-band (OOB) access to BMC

  • SSH access to host systems (for remote mode)

  • Firewall rules for BMC and host communication

  • Segmented network access is supported through split log collection

  • Stable network connectivity to target systems

Server Host Requirements#

The server host requirements are the same as those listed under “Client Host Requirements” with the following additional packages:

Required Packages:

  • nvme-cli

  • pciutils

  • dmidecode

  • lshw

  • opensm

  • nvidia-fabricmanager

  • nvidia-subnet-manager

  • mft-tools

  • NVIDIA Graphics Driver (RM Driver)

  • nvlsm

  • doca-sosreport >= 4.8.0

Important Notes:

Note

NVIDIA Graphics Drivers: Must be installed on the target system to run the following collectors:
  • nvidia-bug-report collector H11

  • nvidia-smi collector H6

nvidia-fabricmanager: Required for nvidia-fabricmanager collector (H12). This collector is specific to InfiniBand network hardware.

nvlsm: Required for subnet manager collector (H10) and a section of the nvidia-bug-report collector (H11).

doca-sosreport: Required for sos-report (H20). The recommended version is v4.8.0 or later. Do not use the Ubuntu system sosreport as it lacks required plugins for NVIDIA hardware.

Network Configuration: BMC Management and Server Host Management networks must reside in the same subnet.

Tip

For example, the server BMC should be accessible from the server host OS using IPMI-over-LAN and the Redfish API using the BMC’s IP address and credentials.

NVSwitch Tray Host Requirements#

The NVSwitch tray host must be running NVOS version 2.

Baseboard Management Controller Requirements#

Before running NVDebug on the BMC, ensure the following tools are installed:

Required Tools:

  • i2cset

  • i2ctransfer

  • ipmitool

  • i2cdump

  • curl

  • Redfish

Network Access:

  • BMC must be accessible via IPMI-over-LAN

  • Redfish API must be enabled and accessible

  • SSH access to BMC (if SSH collection is enabled)

Performance Requirements#

Storage:

  • Minimum 2GB free disk space for collection archives

  • Additional space for heavily used rack systems

  • Monitor archive sizes to avoid disk space exhaustion

Memory:

  • Sufficient RAM for parallel collection operations

  • Consider collection duration and number of nodes

CPU:

  • Adequate CPU resources for collection processing

  • Number of available threads correlates to collection speed

  • NVDebug parallelizes collection across multiple threads

Network:

  • Stable network connectivity between collector and targets

  • Consider bandwidth limitations for large collections

  • Account for firewall rules that might block collection

Known Limitations#

Platform Support:

  • Limited to NVIDIA platforms (DGX, HGX, MGX, GB series)

  • Some collectors are platform-specific

  • New platforms added over time

Collection Limitations:

  • Some collectors may take hours to complete with -VV flag

  • Large log files may require significant time and storage

  • Network timeouts may occur with unstable connections

Security Limitations:

  • Credentials stored in plain text in configuration files

  • No built-in encryption for stored credentials

  • Network security depends on underlying protocols (SSH, IPMI, Redfish)

Performance Limitations:

  • Collection speed depends on network bandwidth

  • Large systems may require extended collection times

  • Parallel collection limited by system resources

Compatibility Limitations:

  • Requires specific Python version (3.12)

  • Limited to Linux operating systems

  • Some collectors require specific hardware configurations