Appendix C: DGX Software Stack
This table lists all packages that are installed as part of the corresponding meta package (highlighted in bold):
DGX A100 |
DGX-2 |
DGX-1 |
---|---|---|
dgx-a100-system-configurations: |
dgx2-system-configurations: |
dgx1-system-confgurations: |
dgx-release |
dgx-release |
dgx-release |
- |
- |
nv-ast-modeset |
nvidia-crashdump |
nvidia-crashdump |
nvidia-crashdump |
- |
nv-enable-nvme-hot-plug |
- |
nv-hugepage |
nv-hugepage |
nv-hugepage |
nv-iommu-pt |
- |
- |
nv-ipmi-devintf |
nv-ipmi-devintf |
nv-ipmi-devintf |
nv-limits |
nv-limits |
nv-limits |
nv-update-disable |
nv-update-disable |
nv-update-disable |
nvidia-acs-disable |
nvidia-acs-disable |
- |
nvidia-kernel-defaults |
nvidia-kernel-defaults |
nvidia-kernel-defaults |
nvidia-nvme-smartd |
nvidia-nvme-smartd |
- |
nvidia-pci-bridge-power |
nvidia-pci-bridge-power |
nvidia-pci-bridge-power |
nvidia-redfish-config |
- |
- |
nvidia-relaxed-ordering-gpu |
- |
- |
nvidia-relaxed-ordering-nvme |
- |
- |
nvgpu-services-list |
- |
- |
dgx-a100-system-tools: |
dgx2-system-tools: |
dgx1-system-tools: |
dgx-release |
dgx-release |
dgx-release |
ipmitool |
ipmitool |
ipmitool |
nv-common-apis |
nv-common-apis |
nv-common-apis |
nv-env-paths |
nv-env-paths |
nv-env-paths |
nvidia-mig-manager |
- |
- |
nvidia-raid-config |
nvidia-raid-config |
nvidia-raid-config |
nvme-cli |
nvme-cli |
- |
tpm2-tools |
tpm-tools |
- |
dgx-a100-system-tools-extra: |
dgx2-system-tools-extra: |
dgx1-system-tools-extra: |
msecli |
msecli |
storcli |
nvidia-mlnx-ofed-misc: |
||
mlnx-fw-updater |
||
mlnx-pxe-setup |
||
nvidia-mlnx-config |
||
nvidia-peer-memory nvidia-peer-memory-dkms |
||
Additional NVIDIA packages |
||
nv-docker-options |
||
nvidia-logrotate |
||
nvidia-motd |
||
nvidia-ipmisol |
The following table lists all packages that will be installed as part of the system configuration package with more details:
Package |
Description |
1 |
2 |
A |
---|---|---|---|---|
dgx-release |
Release information |
R |
R |
R |
nv-ast-modeset |
Disable the Aspeed display driver. It can cause issues with connected monitors. The AST2xxx is the BMC used in our servers. [DGX-1, DGX-2, DGX A100, DGX Station A100] |
R |
R |
R |
nv-enable-nvme-hot-plug |
Configure kernel parameters for NVMe hot plug (see also kernel section below). |
R |
||
nv-hugepage |
Sets the “transpa rent_hugepa ge=madvise” kernel parameter. |
R |
R |
R |
nv-iommu-pt |
Sets iommu=pt for AMD Rome platforms. |
R |
||
nv-ipmi-devintf |
Add the i pmi_devintf module for accessing the BMC using the ipmi tool. |
R |
R |
R |
nv-limits |
Increase the process resource limits for users (ulimits nofile 50000) |
R |
R |
R |
nv-update-disable |
Disable automatic system upgrades. Users need to explicitly upgrade their systems using apt. |
R |
R |
R |
nvgpu-services-list |
Lists GP U-consuming services in .json format, such as DCGM or NVSM, and required by the firmware update mechanism. |
R |
R |
R |
nvidia-acs-disable |
Disables the PCIe ACS capability to allow for better GPU- direct performance in bare-metal use cases on DGX A100. |
R |
||
nvidia-crashdump |
Tools to manage kernel crash dumps. They are disabled by default. |
R |
R |
R |
nv-docker-options |
Increases SHMEM and other resources. |
R |
R |
R |
nvidia-ipmisol [optional] |
Enables serial output through the BMC (SOL - Serial over Lan) |
O |
O |
O |
nvidia-kernel-defaults |
Disable ARP for security i mprovements ne t.ipv4.conf .all.a rp_announce = 2 .all .arp_ignore = 1 .default.a rp_announce = 2 .default .arp_ignore = 1 |
R |
R |
R |
nvidia-logrotate |
Modify the logrotate co nfiguration |
O |
O |
O |
nvidia-motd |
Modify message -of-the-day (MOTD) to display NVSM health monitoring alerts and release i nformation. |
O |
O |
O |
nvidia-nvme-smartd |
Enables SMART monitoring on NVME devices. By default, smartd will skip NVME devices. |
R |
R |
|
nvidia-pci-bridge-power |
Sets the bridge power control setting to “on” for all PCI bridges. |
R |
R |
R |
nvidia-relaxed-ordering-gpu |
Sets a reg-key to enable PCIe relax ed-ordering in the GPUs |
R |
||
nvidia-relaxed-ordering-nvme |
Installs a script that users can call to enable re laxed-order in NVME devices. |
R |
||
nvidia-redfish-config |
Configures the redfish interface with an interface name and IP address. The interface name is “bmc _redfish0”, while the IP address is read from DMI type 42. |
R |
Legend:
- 1
- 2
- A
- R
- O
DGX-1
DGX-2
DGX A100
Required package
Optional package
Kernel Parameter |
Description |
Package |
ast.modeset=0 |
Disable the Aspeed display driver. The AST2xxx is the BMC used in our servers. [DGX-1, DGX-2, DGX A100, DGX Station A100] |
nv-ast-modeset |
crashkernel=1G-:0M |
Don’t reserve any memory for crash dumps (when crah is disabled = default) |
nvidia-crashdump |
crashkernel=1G-:512M |
Reserve 512MB for crash dumps (when crash is enabled) |
nvidia-crashdump |
pci=realloc=on |
Allows kernel to reallocate PCI resources if allocations done by BIOS are insufficient. This and pcie_ports=native are both required for NVME hot-plug on DGX2. |
nv -enable-nvme-hot-plug |
pcie_ports=native |
Use Linux native services for PME, AER, DPC, PCIe hotplug. I.e. not firmware first. This and pci=realloc=on are both required for NVME hot-plug on DGX2. |
nv -enable-nvme-hot-plug |
transparent_hugepage=madvise |
Disable huge pages system-wide and only enable them inside MADV_HUGEPAGE madvise regions to prevent applications from allocating more memory resources than necessary. |
nv-hugepage |
iommu=pt |
Enable pass through mode only and disable DMA translations. This enables optimizations for the CPU inside the DGX A100. |
nv-iommu-pt |
console=ttyS1,115200n8 |
Set console to serial port 1, using 115200 baud, no parity, 8 data bits [DGX-2] |
nvidia-ipmisol |
console=ttyS0,115200n8 |
Set console to serial port 0, using 115200 baud, no parity, 8 data bits |
nvidia-ipmisol |