Appendix C: DGX Software Stack

Base OS - DGX OS 5

This table lists all packages that are installed as part of the corresponding meta package (highlighted in bold):

DGX A100

DGX-2

DGX-1

dgx-a100-system-configurations:

dgx2-system-configurations:

dgx1-system-confgurations:

dgx-release

dgx-release

dgx-release

-

-

nv-ast-modeset

nvidia-crashdump

nvidia-crashdump

nvidia-crashdump

-

nv-enable-nvme-hot-plug

-

nv-hugepage

nv-hugepage

nv-hugepage

nv-iommu-pt

-

-

nv-ipmi-devintf

nv-ipmi-devintf

nv-ipmi-devintf

nv-limits

nv-limits

nv-limits

nv-update-disable

nv-update-disable

nv-update-disable

nvidia-acs-disable

nvidia-acs-disable

-

nvidia-kernel-defaults

nvidia-kernel-defaults

nvidia-kernel-defaults

nvidia-nvme-smartd

nvidia-nvme-smartd

-

nvidia-pci-bridge-power

nvidia-pci-bridge-power

nvidia-pci-bridge-power

nvidia-redfish-config

-

-

nvidia-relaxed-ordering-gpu

-

-

nvidia-relaxed-ordering-nvme

-

-

nvgpu-services-list

-

-

dgx-a100-system-tools:

dgx2-system-tools:

dgx1-system-tools:

dgx-release

dgx-release

dgx-release

ipmitool

ipmitool

ipmitool

nv-common-apis

nv-common-apis

nv-common-apis

nv-env-paths

nv-env-paths

nv-env-paths

nvidia-mig-manager

-

-

nvidia-raid-config

nvidia-raid-config

nvidia-raid-config

nvme-cli

nvme-cli

-

tpm2-tools

tpm-tools

-

dgx-a100-system-tools-extra:

dgx2-system-tools-extra:

dgx1-system-tools-extra:

msecli

msecli

storcli

nvidia-mlnx-ofed-misc:

mlnx-fw-updater

mlnx-pxe-setup

nvidia-mlnx-config

nvidia-peer-memory

nvidia-peer-memory-dkms

Additional NVIDIA packages

nv-docker-options

nvidia-logrotate

nvidia-motd

nvidia-ipmisol

The following table lists all packages that will be installed as part of the system configuration package with more details:

Package

Description

1

2

A

dgx-release

Release information

R

R

R

nv-ast-modeset

Disable the Aspeed display driver. It can cause issues with connected monitors. The AST2xxx is the BMC used in our servers.

[DGX-1, DGX-2, DGX A100, DGX Station A100]

R

R

R

nv-enable-nvme-hot-plug

Configure kernel parameters for NVMe hot plug (see also kernel section below).

R

nv-hugepage

Sets the “transpa rent_hugepa ge=madvise” kernel parameter.

R

R

R

nv-iommu-pt

Sets iommu=pt for AMD Rome platforms.

R

nv-ipmi-devintf

Add the i pmi_devintf module for accessing the BMC using the ipmi tool.

R

R

R

nv-limits

Increase the process resource limits for users (ulimits nofile 50000)

R

R

R

nv-update-disable

Disable automatic system upgrades. Users need to explicitly upgrade their systems using apt.

R

R

R

nvgpu-services-list

Lists GP U-consuming services in .json format, such as DCGM or NVSM, and required by the firmware update mechanism.

R

R

R

nvidia-acs-disable

Disables the PCIe ACS capability to allow for better GPU- direct performance in bare-metal use cases on DGX A100.

R

nvidia-crashdump

Tools to manage kernel crash dumps. They are disabled by default.

R

R

R

nv-docker-options

Increases SHMEM and other resources.

R

R

R

nvidia-ipmisol

[optional]

Enables serial output through the BMC

(SOL - Serial over Lan)

O

O

O

nvidia-kernel-defaults

Disable ARP for security i mprovements ne t.ipv4.conf

.all.a rp_announce = 2

.all .arp_ignore = 1

.default.a rp_announce = 2

.default .arp_ignore = 1

R

R

R

nvidia-logrotate

Modify the logrotate co nfiguration

O

O

O

nvidia-motd

Modify message -of-the-day (MOTD) to display NVSM health monitoring alerts and release i nformation.

O

O

O

nvidia-nvme-smartd

Enables SMART monitoring on NVME devices. By default, smartd will skip NVME devices.

R

R

nvidia-pci-bridge-power

Sets the bridge power control setting to “on” for all PCI bridges.

R

R

R

nvidia-relaxed-ordering-gpu

Sets a reg-key to enable PCIe relax ed-ordering in the GPUs

R

nvidia-relaxed-ordering-nvme

Installs a script that users can call to enable re laxed-order in NVME devices.

R

nvidia-redfish-config

Configures the redfish interface with an interface name and IP address. The interface name is “bmc _redfish0”, while the IP address is read from DMI type 42.

R

Legend:

1

DGX-1

2

DGX-2

A

DGX A100

R

Required package

O

Optional package

Kernel Parameter

Description

Package

ast.modeset=0

Disable the Aspeed display driver. The AST2xxx is the BMC used in our servers.

[DGX-1, DGX-2, DGX A100, DGX

Station A100]

nv-ast-modeset

crashkernel=1G-:0M

Don’t reserve any memory for crash dumps (when crah is disabled = default)

nvidia-crashdump

crashkernel=1G-:512M

Reserve 512MB for crash dumps (when crash is enabled)

nvidia-crashdump

pci=realloc=on

Allows kernel to reallocate PCI resources if allocations done by BIOS are insufficient.

This and pcie_ports=native are both required for NVME hot-plug on DGX2.

nv -enable-nvme-hot-plug

pcie_ports=native

Use Linux native services for PME, AER, DPC, PCIe hotplug.

I.e. not firmware first.

This and pci=realloc=on are both required for NVME hot-plug on DGX2.

nv -enable-nvme-hot-plug

transparent_hugepage=madvise

Disable huge pages system-wide and only enable them inside MADV_HUGEPAGE madvise regions to prevent applications from allocating more memory resources than necessary.

nv-hugepage

iommu=pt

Enable pass through mode only and disable DMA translations. This enables optimizations for the CPU inside the DGX A100.

nv-iommu-pt

console=ttyS1,115200n8

Set console to serial port 1, using 115200 baud, no parity, 8 data bits

[DGX-2]

nvidia-ipmisol

console=ttyS0,115200n8

Set console to serial port 0, using 115200 baud, no parity, 8 data bits

nvidia-ipmisol

© Copyright 2020-2023, NVIDIA. Last updated on Mar 24, 2023.