B. DGX Software Stack
NVIDIA DGX Software Packages
This table lists all packages that are installed as part of the corresponding meta package:
| DGX A100 | DGX-2 | DGX-1 |
|---|---|---|
|
dgx-a100-system-configurations dgx-release - nvidia-crashdump - nv-hugepage nv-iommu-pt nv-ipmi-devintf nv-limits nv-update-disable nvidia-acs-disable nvidia-kernel-defaults nvidia-nvme-smartd nvidia-pci-bridge-power nvidia-redfish-config nvidia-relaxed-ordering-gpu nvidia-relaxed-ordering-nvme nvgpu-services-list |
dgx2-system-configurations dgx-release - nvidia-crashdump nv-enable-nvme-hot-plug nv-hugepage - nv-ipmi-devintf nv-limits nv-update-disable nvidia-acs-disable nvidia-kernel-defaults nvidia-nvme-smartd nvidia-pci-bridge-power - - - nvgpu-services-list |
dgx1-system-confgurations dgx-release nv-ast-modeset nvidia-crashdump - nv-hugepage - nv-ipmi-devintf nv-limits nv-update-disable - nvidia-kernel-defaults - nvidia-pci-bridge-power - - - nvgpu-services-list |
|
dgx-a100-system-tools dgx-release ipmitool nv-common-apis nv-env-paths nvidia-mig-manager nvidia-raid-config nvme-cli tpm2-tools |
dgx2-system-tools dgx-release ipmitool nv-common-apis nv-env-paths - nvidia-raid-config nvme-cli tpm-tools |
dgx1-system-tools dgx-release ipmitool nv-common-apis nv-env-paths - - - - |
|
dgx-a100-system-tools-extra msecli |
dgx2-system-tools-extra msecli |
dgx1-system-tools-extra nvidia-raid-config storcli |
|
nvidia-mlnx-ofed-misc mlnx-fw-updater mlnx-pxe-setup nvidia-mlnx-config nvidia-peer-memory | nvidia-peer-memory-dkms |
||
|
Additional packages nv-docker-options nvidia-logrotate nvidia-motd nvidia-ipmisol |
||
The following table lists all packages that will be installed as part of the system configuration package with more details:
| Package | Description | 1 | 2 | A |
|---|---|---|---|---|
| dgx-release | Release information | R | R | R |
| nv-ast-modeset |
Disable the Aspeed display driver. It can cause issues with connected monitors. The AST2xxx is the BMC used in our servers. [DGX-1, DGX-2, DGX A100, DGX Station A100] |
R | R | R |
| nv-enable-nvme-hot-plug | Configure kernel parameters for NVMe hot plug (see also kernel section below). | R | ||
| nv-hugepage | Sets the "transparent_hugepage=madvise" kernel parameter. | R | R | R |
| nv-iommu-pt | Sets iommu=pt for AMD Rome platforms. | R | ||
| nv-ipmi-devintf | Add the ipmi_devintf module for accessing the BMC using the ipmi tool. | R | R | R |
| nv-limits | Increase the process resource limits for users (ulimits nofile 50000) | R | R | R |
| nv-update-disable | Disable automatic system upgrades. Users need to explicitly upgrade their systems using apt. | R | R | R |
| nvgpu-services-list | Lists GPU-consuming services in .json format, such as DCGM or NVSM, and required by the firmware update mechanism. | R | R | R |
| nvidia-acs-disable | Disables the PCIe ACS capability to allow for better GPU-direct performance in bare-metal use cases on DGX A100. | R | ||
| nvidia-crashdump | Tools to manage kernel crash dumps. They are disabled by default. | R | R | R |
| nv-docker-options | Increases SHMEM and other resources. | R | R | R |
| nvidia-ipmisol [optional] |
Enables serial output through the BMC (SOL - Serial over Lan) |
O | O | O |
| nvidia-kernel-defaults |
Disable ARP for security improvements net.ipv4.conf .all.arp_announce = 2 .all.arp_ignore = 1 .default.arp_announce = 2 .default.arp_ignore = 1 |
R | R | R |
| nvidia-logrotate | Modify the logrotate configuration | O | O | O |
| nvidia-motd | Modify message-of-the-day (MOTD) to display NVSM health monitoring alerts and release information. | O | O | O |
| nvidia-nvme-smartd | Enables SMART monitoring on NVME devices. By default, smartd will skip NVME devices. | R | R | |
| nvidia-pci-bridge-power | Sets the bridge power control setting to “on” for all PCI bridges. | R | R | R |
| nvidia-relaxed-ordering-gpu | Sets a reg-key to enable PCIe relaxed-ordering in the GPUs | R | ||
| nvidia-relaxed-ordering-nvme | Installs a script that users can call to enable relaxed-order in NVME devices. | R | ||
| nvidia-redfish-config | Configures the redfish interface with an interface name and IP address. The interface name is “bmc_redfish0”, while the IP address is read from DMI type 42. | R |
DGX Kernel Parameters
| Kernel Parameter | Description | Package |
|---|---|---|
| ast.modeset=0 |
Disable the Aspeed display driver. The AST2xxx is the BMC used in our servers. [DGX-1, DGX-2, DGX A100, DGX Station A100] |
nv-ast-modeset |
| crashkernel=1G-:0M | Don't reserve any memory for crash dumps (when crash is disabled = default) | nvidia-crashdump |
| crashkernel=1G-:512M | Reserve 512MB for crash dumps (when crash is enabled) | nvidia-crashdump |
| pci=realloc=on |
Allows kernel to reallocate PCI resources if allocations done by BIOS are insufficient. This and pcie_ports=native are both required for NVME hot-plug on DGX2. |
nv-enable-nvme-hot-plugth |
| pcie_ports=native |
Use Linux native services for PME, AER, DPC, PCIe hotplug. I.e. not firmware first. This and pci=realloc=on are both required for NVME hot-plug on DGX2. |
nv-enable-nvme-hot-plug |
| transparent_hugepage=madvise | Disable huge pages system-wide and only enable them inside MADV_HUGEPAGE madvise regions to prevent applications from allocating more memory resources than necessary. | nv-hugepage |
| iommu=pt | Enable pass through mode only and disable DMA translations. This enables optimizations for the CPU inside the DGX A100. | nv-iommu-pt |
| console=ttyS1,115200n8 |
Set console to serial port 1, using 115200 baud, no parity, 8 data bits [DGX-2] |
nvidia-ipmisol |
| console=ttyS0,115200n8 | Set console to serial port 0, using 115200 baud, no parity, 8 data bits | nvidia-ipmisol |