B. DGX Software Stack

This section lists the DGX software packages and kernel parameters in the DGX Software Stack.

NVIDIA DGX Software Packages

This table lists all packages that are installed as part of the corresponding meta package:

DGX A100 DGX-2 DGX-1

dgx-a100-system-configurations

dgx-release

-

nvidia-crashdump

-

nv-hugepage

nv-iommu-pt

nv-ipmi-devintf

nv-limits

nv-update-disable

nvidia-acs-disable

nvidia-kernel-defaults

nvidia-nvme-smartd

nvidia-pci-bridge-power

nvidia-redfish-config

nvidia-relaxed-ordering-gpu

nvidia-relaxed-ordering-nvme

nvgpu-services-list

dgx2-system-configurations

dgx-release

-

nvidia-crashdump

nv-enable-nvme-hot-plug

nv-hugepage

-

nv-ipmi-devintf

nv-limits

nv-update-disable

nvidia-acs-disable

nvidia-kernel-defaults

nvidia-nvme-smartd

nvidia-pci-bridge-power

-

-

-

nvgpu-services-list

dgx1-system-confgurations

dgx-release

nv-ast-modeset

nvidia-crashdump

-

nv-hugepage

-

nv-ipmi-devintf

nv-limits

nv-update-disable

-

nvidia-kernel-defaults

-

nvidia-pci-bridge-power

-

-

-

nvgpu-services-list

dgx-a100-system-tools

dgx-release

ipmitool

nv-common-apis

nv-env-paths

nvidia-mig-manager

nvidia-raid-config

nvme-cli

tpm2-tools

dgx2-system-tools

dgx-release

ipmitool

nv-common-apis

nv-env-paths

-

nvidia-raid-config

nvme-cli

tpm-tools

dgx1-system-tools

dgx-release

ipmitool

nv-common-apis

nv-env-paths

-

-

-

-

dgx-a100-system-tools-extra

msecli

dgx2-system-tools-extra

msecli

dgx1-system-tools-extra

nvidia-raid-config

storcli

nvidia-mlnx-ofed-misc

mlnx-fw-updater

mlnx-pxe-setup

nvidia-mlnx-config

nvidia-peer-memory | nvidia-peer-memory-dkms

Additional packages

nv-docker-options

nvidia-logrotate

nvidia-motd

nvidia-ipmisol

The following table lists all packages that will be installed as part of the system configuration package with more details:

Package Description 1 2 A
dgx-release Release information R R R
nv-ast-modeset

Disable the Aspeed display driver. It can cause issues with connected monitors. The AST2xxx is the BMC used in our servers.

[DGX-1, DGX-2, DGX A100, DGX Station A100]

R R R
nv-enable-nvme-hot-plug Configure kernel parameters for NVMe hot plug (see also kernel section below).   R  
nv-hugepage Sets the "transparent_hugepage=madvise" kernel parameter. R R R
nv-iommu-pt Sets iommu=pt for AMD Rome platforms.     R
nv-ipmi-devintf Add the ipmi_devintf module for accessing the BMC using the ipmi tool. R R R
nv-limits Increase the process resource limits for users (ulimits nofile 50000) R R R
nv-update-disable Disable automatic system upgrades. Users need to explicitly upgrade their systems using apt. R R R
nvgpu-services-list Lists GPU-consuming services in .json format, such as DCGM or NVSM, and required by the firmware update mechanism. R R R
nvidia-acs-disable Disables the PCIe ACS capability to allow for better GPU-direct performance in bare-metal use cases on DGX A100.     R
nvidia-crashdump Tools to manage kernel crash dumps. They are disabled by default. R R R
nv-docker-options Increases SHMEM and other resources. R R R
nvidia-ipmisol [optional]

Enables serial output through the BMC

(SOL - Serial over Lan)

O O O
nvidia-kernel-defaults

Disable ARP for security improvements

net.ipv4.conf

.all.arp_announce = 2

.all.arp_ignore = 1

.default.arp_announce = 2

.default.arp_ignore = 1

R R R
nvidia-logrotate Modify the logrotate configuration O O O
nvidia-motd Modify message-of-the-day (MOTD) to display NVSM health monitoring alerts and release information. O O O
nvidia-nvme-smartd Enables SMART monitoring on NVME devices. By default, smartd will skip NVME devices.   R R
nvidia-pci-bridge-power Sets the bridge power control setting to “on” for all PCI bridges. R R R
nvidia-relaxed-ordering-gpu Sets a reg-key to enable PCIe relaxed-ordering in the GPUs     R
nvidia-relaxed-ordering-nvme Installs a script that users can call to enable relaxed-order in NVME devices.     R
nvidia-redfish-config Configures the redfish interface with an interface name and IP address. The interface name is “bmc_redfish0”, while the IP address is read from DMI type 42.     R
Legend
  • 1: DGX-1
  • 2: DGX-2
  • A: DGX A100
  • R: Required package
  • O: Optional package

DGX Kernel Parameters

Kernel Parameter Description Package
ast.modeset=0

Disable the Aspeed display driver. The AST2xxx is the BMC used in our servers.

[DGX-1, DGX-2, DGX A100, DGX Station A100]

nv-ast-modeset
crashkernel=1G-:0M Don't reserve any memory for crash dumps (when crash is disabled = default) nvidia-crashdump
crashkernel=1G-:512M Reserve 512MB for crash dumps (when crash is enabled) nvidia-crashdump
pci=realloc=on

Allows kernel to reallocate PCI resources if allocations done by BIOS are insufficient.

This and pcie_ports=native are both required for NVME hot-plug on DGX2.

nv-enable-nvme-hot-plugth
pcie_ports=native

Use Linux native services for PME, AER, DPC, PCIe hotplug. I.e. not firmware first.

This and pci=realloc=on are both required for NVME hot-plug on DGX2.

nv-enable-nvme-hot-plug
transparent_hugepage=madvise Disable huge pages system-wide and only enable them inside MADV_HUGEPAGE madvise regions to prevent applications from allocating more memory resources than necessary. nv-hugepage
iommu=pt Enable pass through mode only and disable DMA translations. This enables optimizations for the CPU inside the DGX A100. nv-iommu-pt
console=ttyS1,115200n8

Set console to serial port 1, using 115200 baud, no parity, 8 data bits

[DGX-2]

nvidia-ipmisol
console=ttyS0,115200n8 Set console to serial port 0, using 115200 baud, no parity, 8 data bits nvidia-ipmisol