DGX Software Stack#
NVIDIA DGX Software Packages#
The following tables list the packages installed as part of the DGX Software Stack, categorized by metapackage names.
nvidia-system-core#
Package Name |
Description |
---|---|
cuda-compute-repo |
CUDA compute repository configuration files. |
dgx-release |
Package updates the DGX OS release information. |
dgx-repo |
DGX repository configuration files. |
hpc-sdk-repo |
NVIDIA HPC SDK repository configuration files. |
msecli |
Micron Storage Executive CLI. |
nv-common-apis |
Install commonly used scripts used by Nvidia systems. |
nv-cpu-governor |
Set CPU governor to performance. |
nv-env-paths |
Configure the PATH variable. |
nv-grubmenu |
Make Grub menu visible. |
nv-grubserial |
Display GRUB menu over a serial console. |
nv-iommu |
Enable iommu in passthrough mode; enable intel_iommu on systems with Emerald Rapids CPU. |
nv-ipmi-devintf |
Load the ipmi_devintf module. |
nv-limits |
Increase the file limit. |
nv-update-disable |
Disable OS update prompt. |
nvgpu-services-list |
List all GPU-related services. |
nvidia-acs-disable |
Disable the PCIe ACS capability. |
nvidia-crashdump |
NVIDIA crash dump policy. |
nvidia-disable-init-on-alloc |
Disable heap memory zeroing on allocation. |
nvidia-disable-numa-balancing |
Disable automatic page fault NUMA memory balancing. |
nvidia-earlycon |
Set up the early console with no options. |
nvidia-enable-power-meter-cap |
Enable power capping functionality in ACPI power meter. |
nvidia-esm-hook-epilogue |
NVIDIA package to clarify ESM policy. |
nvidia-fs-loader |
Load the nvidia-fs module. |
nvidia-ipmisol |
Enable IPMI Serial-over-LAN. |
nvidia-kernel-defaults |
The sysctl default kernel settings for DGX. |
nvidia-mig-manager |
NVIDIA MIG Partition Editor and Systemd Service. |
nvidia-nvme-options |
Automatically enables NVMe Interrupt Coalescing at bootup on all Samsung and Kioxia drives. |
nvidia-pci-bridge-power |
Set the PCI bridge power control to on. |
nvidia-pci-realloc |
Force PCI resource reallocation. |
nvidia-raid-config |
DGX RAID Configuration. |
nvidia-redfish-config |
Configure Redfish Host Interface. |
nvidia-relaxed-ordering-gpu |
Configure PCIe Relaxed Ordering. |
nvidia-relaxed-ordering-nvme |
Configure PCIe Relaxed Ordering. |
nvidia-repo-keys |
Add keys to apt trusted.gpg database. |
nvidia-system-utils#
Package Name |
Description |
---|---|
nv-persistence-mode |
Enable persistence mode. |
nvidia-conf-cachefilesd |
Systemd settings for cachefilesd. |
nvidia-fs-loader |
Load the nvidia-fs module. |
nvidia-logrotate |
NVIDIA logrotate policy. |
nvidia-motd |
Custom motd files for NVIDIA platforms. |
nvsm |
REST API services for DGX System Management. |
nvidia-system-mlnx-drivers#
Package Name |
Description |
---|---|
doca-ofed |
The doca-ofed metapackage. |
doca-repo |
DOCA repository configuration files. |
mlnx-nfsrdma-dkms |
DKMS support for NFS RDMA kernel module. |
mlnx-nvme-dkms |
DKMS support for nvme kernel module. |
mlnx-pxe-setup |
Provide a script to enable PXE booting using Mellanox cards. |
nvidia-ib-umad-loader |
Load the ib_umad module. |
nvidia-mlnx-config |
Configure the MLNX devices. |
mlnx-fw-updater |
Firmware update binaries and utility |
DGX Kernel Parameters, System Configuration Settings, and Runtime Commands#
Kernel Parameters#
Parameter Name |
Description |
Package |
Location |
---|---|---|---|
crashkernel |
Amount of memory to use for crash dumps. |
nvidia-crashdump |
/etc/default/grub.d/kdump-tools.cfg |
console=ttyS[0-1],11 5200n8 |
Set the console to serial port 0 or 1, using 115200 baud, no parity, and 8 data bits. For dgx-h100 and dgx-h800: console=ttyS0,115200 n8. Other system types: console=ttyS1,115200 n8. |
nvidia-ipmisol |
/etc/default/grub.d/ ipmisol.cfg |
iommu=pt |
Enable pass through mode only and disable DMA translations. This enables optimizations for the CPU inside the DGX A100. |
nv-iommu |
/etc/default/grub.d/ iommu.cfg |
pci=realloc=on |
Allow kernel to reallocate PCI resources if allocations done by BIOS are insufficient. This and pcie_ports=native are both required for NVME hot-plug on DGX2. |
nv-enable-nvme-hot-plug |
/etc/default/grub.d/ enable-nvme-hot-plug.cfg |
System Configuration Settings#
Parameter Name |
Description |
Package |
Location |
---|---|---|---|
net.ipv4.conf.all.ar p_announce = 2 |
Always use the best local address for this target. |
nvidia-kernel-defaults |
/etc/sysctl.d/20-nvidia-defaults.conf |
net.ipv4.conf.defaul t.arp_announce = 2 |
Always use the best local address for this target. |
nvidia-kernel-defaults |
/etc/sysctl.d/20-nvidia-defaults.conf |
net.ipv4.conf.all.ar p_ignore = 1 |
Only reply to ARP requests on the interface that contains the target IP address. |
nvidia-kernel-defaults |
/etc/sysctl.d/20-nvidia-defaults.conf |
net.ipv4.conf.default.arp_ignore = 1 |
Only reply to ARP requests on the interface that contains the target IP address. |
nvidia-kernel-defaults |
/etc/sysctl.d/20-nvidia-defaults.conf |
NVreg_EnablePCIERela xedOrderingMode=1 |
Set a reg-key to enable PCIe relaxed-ordering in the GPUs. |
nvidia-relaxed-ordering-gpu |
/etc/modprobe.d/ nvidia-relaxed-ordering.conf |
kernel.panic_on_unrecovered_nmi = 1 |
Configure the system to panic on unrecoverable NMI (Non-Maskable Interrupt). |
nvidia-crashdump |
/etc/sysctl.d/90-dgx-crashdump.conf |
kernel.unknown_nmi_panic = 1 |
Configure the system to panic on an unknown NMI. |
nvidia-crashdump |
/etc/sysctl.d/90-dgx-crashdump.conf |
kernel.hardlockup_panic = 1 |
Configure the system to panic when a hard lockup is detected. |
nvidia-crashdump |
/etc/sysctl.d/90-dgx-crashdump.conf |
kernel.panic_on_io_nmi = 1 |
Configure the system to panic on I/O NMI. |
nvidia-crashdump |
/etc/sysctl.d/90-dgx-crashdump.conf |
kernel.softlockup_panic = 1 |
Configure the system to panic when a soft lockup is detected. |
nvidia-crashdump |
/etc/sysctl.d/90-dgx-crashdump.conf |
kernel.panic_on_oops = 1 |
Configure the system to panic when an Oops occurs. |
nvidia-crashdump |
/etc/sysctl.d/90-dgx-crashdump.conf |
kernel.hung_task_panic = 1 |
Configure the system to panic when a hung task is detected. |
nvidia-crashdump |
/etc/sysctl.d/90-dgx-crashdump.conf |
kernel.panic_on_rcu_stall = 1 |
Configure the system to panic when an RCU stall is detected. |
nvidia-crashdump |
/etc/sysctl.d/90-dgx-crashdump.conf |
kernel.panic = 30 |
Configure the system to reboot after 30 seconds if a panic occurs. |
nvidia-crashdump |
/etc/sysctl.d/90-dgx-crashdump.conf |
Runtime Commands#
Command |
Description |
Package |
Location |
---|---|---|---|
setpci -d ::207 68.w=5000:f000 |
Set MaxReadReq size to 4KB for ConnectX-6 Network (2) Infiniband (07) devices. Only needed for DGX A100/A800 |
nvidia-mlnx-config |
/etc/systemd/system/ nvidia-mlnx-config.service |
mlxconfig -y -d <device> set ADVANCED_PCI_SETTINGS=1 |
Configure Mellanox network device PCI settings. Enables advanced PCI settings. Only needed for DGX A100/A800 |
nvidia-mlnx-config |
/usr/bin/nvidia-mlnx-config.sh |
mlxconfig -y -d <device> set MAX_ACC_OUT_READ=44 |
Configure Mellanox network device PCI settings. Sets maximum accumulated outbound read requests to 44. Only needed for DGX A100/A800 |
nvidia-mlnx-config |
/usr/bin/nvidia-mlnx-config.sh |
DGX Platform JSON Configuration#
{
"dgx_a800":
{
"PlatformType": "DGX A800",
"GrepStr": "^.*920-23687-2535-.*",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "True",
"EnablePowerMeterCap": "False",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "False",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "True",
"IPMIDefaultSerialTTY": "ttyS1",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "True",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "",
"CrashdumpMem": "2048M,high"
},
"dgx_a100":
{
"PlatformType": "DGX A100",
"GrepStr": "^.*920-23687-.*\\|^.*675-23287-.*\\|^.*920-23287-.*",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "True",
"EnablePowerMeterCap": "False",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "False",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "True",
"IPMIDefaultSerialTTY": "ttyS1",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "True",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "",
"CrashdumpMem": "1G-:1024M"
},
"dgx_h100":
{
"PlatformType": "DGX H100",
"GrepStr": "^.*DGXH100.*\\|^.*DGX H100.*",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "False",
"EnablePowerMeterCap": "False",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "True",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "False",
"IPMIDefaultSerialTTY": "ttyS0",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "True",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "off",
"CrashdumpMem": "1G-:1024M"
},
"dgx_h200":
{
"PlatformType": "DGX H200",
"GrepStr": "^.*DGXH200.*\\|^.*DGX H200.*",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "False",
"EnablePowerMeterCap": "False",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "True",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "False",
"IPMIDefaultSerialTTY": "ttyS0",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "True",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "off",
"CrashdumpMem": "1G-:1024M"
},
"dgx_h800":
{
"PlatformType": "DGX H100",
"GrepStr": "^.*DGXH800.*\\|^.*DGX H800.*",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "False",
"EnablePowerMeterCap": "False",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "True",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "False",
"IPMIDefaultSerialTTY": "ttyS0",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "True",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "off",
"CrashdumpMem": "1G-:1024M"
},
"dgx_b200":
{
"PlatformType": "DGX B200",
"GrepStr": "^.*DGXB200.*\\|^.*DGX B200.*",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "False",
"EnablePowerMeterCap": "False",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "True",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "False",
"IPMIDefaultSerialTTY": "ttyS0",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "True",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"CrashdumpMem": "2048M,high"
},
"dgxstation_a100":
{
"PlatformType": "DGXSTATION A100",
"GrepStr": "^.*920-23487-.*\\|^.*675-23487-.*\\|^DGX Station A100",
"ConfigureDGXA100Raid": "False",
"ConfigureDGXStationA100Raid": "True",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "True",
"HasRedfishIntf": "True",
"UsesNetplan": "False",
"EnablePowerMeterCap": "False",
"UsesNetworkManager": "True",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "False",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "True",
"IPMIDefaultSerialTTY": "ttyS1",
"NeedsOEMXconfigOverride": "True",
"NeedsInitialNvidiaXconfig": "True",
"NeedsAdaptiveNvidiaXconfig": "True",
"NeedsContainerdOverride": "True",
"UsesFabricManager": "False",
"IsDgxServerType": "False",
"IsDgxDesktopType": "True",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "",
"CrashdumpMem": "1G-:1024M"
},
"dgxstation_a800":
{
"PlatformType": "DGXSTATION A800",
"GrepStr": "^.*920-23487-2535.*\\|^.*675-23487-0200.*",
"ConfigureDGXA100Raid": "False",
"ConfigureDGXStationA100Raid": "True",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "True",
"HasRedfishIntf": "True",
"UsesNetplan": "False",
"EnablePowerMeterCap": "False",
"UsesNetworkManager": "True",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "False",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "True",
"IPMIDefaultSerialTTY": "ttyS1",
"NeedsOEMXconfigOverride": "True",
"NeedsInitialNvidiaXconfig": "True",
"NeedsAdaptiveNvidiaXconfig": "True",
"NeedsContainerdOverride": "True",
"UsesFabricManager": "False",
"IsDgxServerType": "False",
"IsDgxDesktopType": "True",
"NeedsDisableNumaBalance": "False",
"NeedsIommuPt": "True",
"NeedsDisableInitOnAlloc": "False",
"NeedsEarlycon": "False",
"PciRealloc": "",
"CrashdumpMem": "1G-:1024M"
},
"dgx_gb200":
{
"PlatformType": "DGX GB200",
"GrepStr": "^.*DGXGB200.*\\|^.*DGX GB200.*\\|^ *GB200 NVL.*",
"ConfigureDGXA100Raid": "True",
"ConfigureDGXStationA100Raid": "False",
"NVMERelaxedOrdering": "True",
"GpuRelaxedOrdering": "False",
"EnablePowerMeterCap": "True",
"BMCPasswordMinLength": "13",
"BMCPasswordMaxLength": "20",
"BMCPasswordSupportsZerofill": "True",
"BMCPasswordComplexityReq": "True",
"NVSMAlertsSupported": "True",
"NeedsMRRSConfig": "True",
"NeedsAccBytesTuning": "False",
"IPMIDefaultSerialTTY": "ttyS0",
"NeedsOEMXconfigOverride": "False",
"NeedsInitialNvidiaXconfig": "False",
"NeedsAdaptiveNvidiaXconfig": "False",
"NeedsContainerdOverride": "False",
"NeedsOemConfigPostActNetplanApply": "False",
"UsesFabricManager": "False",
"IsDgxServerType": "True",
"IsDgxDesktopType": "False",
"NeedsDisableNumaBalance": "True",
"NeedsIommuPt": "False",
"NeedsDisableInitOnAlloc": "True",
"NeedsEarlycon": "True",
"PciRealloc": "",
"CrashdumpMem": "2048M,high"
}
}
DGX Platform JSON Configuration Definitions#
Name |
Definition |
---|---|
PlatformType |
Printable string representation of the platform type (for example, DGX H100). |
GrepStr |
Regex matching the product
name as reported by |
ConfigureDGXA100Raid |
Used by |
ConfigureDGXStationA100Raid |
Used to create RAID array with a DGX Station A100-like disk arrangement: single U.2, no RAID. |
NVMERelaxedOrdering |
Package installs the
|
GpuRelaxedOrdering |
The nvidia-relaxed-ordering-gpus package calls this function to change GPU driver settings based on platform. |
EnablePowerMeterCap |
The package
nvidia-enable-power-meter-cap
configures a server to
enable power capping in ACPI
power meter for Grace based
platforms. Setting this to
|
BMCPasswordMinLength |
The package nvidia-oem-config-plugins creates EULA, BMC, etc oem-config screens that use this attribute to set BMC password requirements during ISO installation. |
BMCPasswordMaxLength |
The package nvidia-oem-config-plugins creates EULA, BMC, etc oem-config screens that use this attribute to set BMC password requirements during ISO installation. |
BMCPasswordSupportsZerofill |
The package nvidia-oem-config-plugins creates EULA, BMC, etc oem-config screens that use this attribute to set BMC password requirements during ISO installation. |
BMCPasswordComplexityReq |
The package nvidia-oem-config-plugins creates EULA, BMC, etc oem-config screens that use this attribute to set BMC password requirements during ISO installation. |
NVSMAlertsSupported |
NVSM is only supported on
DGX Platforms. If NVSM is
installed, |
NeedsMRRSConfig |
Nvidia-mlnx-config uses this attribute to use mlxconfig and sets various PCI settings. Sets MaxReadReq size to 4KB for all Network (2) Infiniband (07) devices on DGX A100, DGX A800 and DGX2 only. |
NeedsAccBytesTuning |
Nvidia-mlnx-config uses this attribute to use mlxconfig and sets various PCI settings. |
IPMIDefaultSerialTTY |
Sets a default IPMI serial console port in grub kernel parameters. |
NeedsOEMXconfigOverride |
For dgxstation_a100 or dgxstation_a800, nvidia-conf-xconfig create oem-config override service. |
NeedsInitialNvidiaXconfig |
For dgxstation_a100 or dgxstation_a800, nvidia-conf-xconfig calls nvidia xconfig and creates empty initial configuration. |
NeedsAdaptiveNvidiaXconfig |
For dgxstation_a100 or dgxstation_a800, nvidia-conf-xconfig calls nvidia xconfig and creates empty initial configuration. |
NeedsContainerdOverride |
The package nv-docker-gpus checks this for dgxstation_a100 or dgxstation_a800. In these cases, this package limits nvidia docker to use 3D Controller class GPUs. |
NeedsOemConfigPostActNetplanApply |
Nvidia-oem-config-postact checks for DCS and DCS legacy platforms, this forces a “netplan apply” after OEM ISO installation. |
UsesFabricManager |
For platforms DGX2 up to DGX B200, package dgx-release-upgrade checks this to install proper nvidia-fabricmanager package for GPU driver. |
IsDgxServerType |
During a release upgrade,
|
IsDgxDesktopType |
During a release upgrade,
|
NeedsDisableNumaBalance |
In Grace based platforms, it configures the system to disable automatic page fault NUMA memory balancing. |
NeedsIommuPt |
Set |
NeedsDisableInitOnAlloc |
Adds the option
|
NeedsEarlycon |
Adds the option |
PciRealloc |
Determine whether to set
grub to |
CrashdumpMem |
The kdump service uses this value to reserve the crash kernel memory for each kernel. The minimum size of the crash kernel can vary depending on the hardware and machine specifications. |