Customizing Ubuntu Installation with DGX Software#
This section explains the steps for installing and configuring Ubuntu and the NVIDIA DGX Software Stack on DGX systems.
DGX OS provides a customized Ubuntu installation with additional software from NVIDIA to provide a turnkey solution for running AI and analytics workloads. The additional software, the NVIDIA DGX Software Stack, comprises platform-specific configurations, diagnostic and monitoring tools, and drivers that are required for a stable, tested, and supported OS to run AI, machine learning, and analytics applications on DGX systems.
You also have the option to install the NVIDIA DGX Software Stack on top of a vanilla Ubuntu distribution while still benefiting from the advanced DGX features. This installation method supports more flexibility, such as custom partition schemes.
Cluster deployments also benefit from this installation method by taking
advantage of Ubuntu’s standardized automated and non-interactive installation
process. Starting with Ubuntu 20.04, the installer introduced a new mechanism
for automating the installation allowing system administrators to install a
system unattended and non-interactively. You can find information for creating
such a cloud-init
configuration file in Cloud-init Configuration File.
For more information, refer to Ubuntu Automated Server Installation.
The intended audience is IT professionals managing a cluster of DGX systems and integration partners.
Prerequisites#
The following prerequisites are required or recommended, where indicated.
Ubuntu Software Requirements#
The DGX Software Stack requires the following software versions:
Ubuntu 24.04
Linux Kernel 6.8
Access to Software Repositories#
The DGX Software Stack is available from repositories that can be accessed from the internet. If your installation does not allow connection to the internet, see appendix Air-Gapped Installations for information about installing and upgrading software on “air-gapped” systems.
If you are using a proxy server, follow the instructions in the section Network Configuration for setting up a proxy configuration.
Installation Considerations#
Installing the NVIDIA DGX Software Stack on Ubuntu allows you to select from additional configuration options that would otherwise not be available with the preconfigured DGX OS installer. This includes drive partitioning, filesystem choices, and software selection.
Before you start installing Ubuntu and the NVIDIA DGX Software Stack, you should evaluate the following options. The installation and configuration instructions will be covered in the respective section of this document.
System Drive Mirroring (RAID-1) [recommended]#
The DGX B200, DGX H100/H200, and DGX A100 systems embed two system drives for mirroring the OS partitions (RAID-1). This ensures data resiliency if one drive fails. To enable mirroring, you need to enable it during the drive configuration of the Ubuntu installation. It cannot be enabled after the installation.
Data Drive RAID-0 or RAID-5#
DGX systems are equipped with multiple data drives that can be configured as RAID-0 for performance or RAID-5 for resiliency. RAID-0 provides the maximum storage capacity and performance but does not provide any redundancy. If a single SSD in the array fails, all data stored on the array is lost.
RAID-0 is recommended for data caching. You can use cachefilesd to provide caching for NFS shares. The network file system (NFS) is required to take advantage of the cache file system. RAID-5 should be used for persistent data storage.
You have the option to configure RAID and data caching after the initial Ubuntu
installation using the nvidia-config-raid
tool or during the Ubuntu
installation. The nvidia-config-raid tool is recommended for manual
installation.
System Drive Encryption [optional]#
Root filesystem encryption is a software-based method to protect the content stored in the system partition(s) from unauthorized access by encrypting the data on-the-fly. It requires users to unlock the filesystem on every boot, either manually by entering a passphrase or automatically using a centralized key server.
System drive encryption can only be enabled during the installation of Ubuntu.
Data Drive Encryption [optional]#
Data drive encryption is only supported on DGX B200, DGX H100/H200, and DGX A100 systems equipped with
self-encrypting drives (SED). It can be enabled after Ubuntu is installed using
the nv-encrypt
tool. It requires either to store the keys in the TPM or
external key-management.
System Drive Partitioning#
Ubuntu uses only a single partition for the entire filesystem by default. This can be configured during the Ubuntu installation for deployments that require a more faceted partition scheme for security reasons. The recommended partitioning scheme is to use only a single partition for the Linux root partition with the ext4 filesystem.
Installing Ubuntu#
There are several methods for installing Ubuntu as described in the Ubuntu Server Guide.
For convenience, this section provides additional instructions that are specific to DGX for installing Ubuntu following the Basic Installation. If you have a preferred method for installing Ubuntu, skip this section.
Steps that are covered in this section:
Connecting to the DGX system
Booting from the installation media
Running the Ubuntu installer (including network and storage configuration steps)
Booting from the Installation Media#
During the installation and configuration steps, you need to connect to the console of the DGX system. Refer to Connecting to the DGX System for more details.
Boot the Ubuntu ISO image in one of the following ways:
Remotely through the BMC for systems that provide a BMC
For instructions, refer to Installing the DGX OS Image Remotely Through the BMC.
Locally from a UEFI-bootable USB flash drive or DVD-ROM
For instructions, refer to Installing the DGX OS Image from a USB Flash Drive or DVD-ROM.
Running the Ubuntu Installer#
After booting the ISO image, the Ubuntu installer should start and guide you through the installation process.
Note
The screenshots in the following steps are taken from a DGX A100. Other DGX systems have differences in drive partitioning and networking.
During the boot process of the ISO image, you might see some error messages due to older drivers, etc. They can be safely ignored.

Select your language at the welcome screen, and then follow the instructions to select whether to update the installer (if offered) and to choose your keyboard..
At the Network connections screen, configure your network.
The installer tries to automatically retrieve a DHCP address for all network interfaces, so you should be able to continue without any changes. However, you also have the option to manually configure the interface(s).
At the Guided storage configuration screen, configure the partitioning and file systems. All DGX systems are shipped preinstalled with DGX OS. The drives are, therefore, already partitioned and formatted. DGX OS installer configures a single ext4 partition for the root partition in addition to the EFI boot partition. You have the following options:
Keep the current partition layout and formatting [recommended].
Create a custom partition scheme [advanced].
Use a single disk with the default Ubuntu partition scheme.
Creating a new custom partition scheme with a RAID configuration is a more involved process and out of the scope for this document. Refer to the Ubuntu installation guide for more information. When you choose the option to use an entire disk, Ubuntu will only use one of the two redundant boot drives.
Note
The RAID level for the data drive can be changed after the installation of Ubuntu.
The following instructions describe the steps for keeping the current partition layout. It still requires you to re-create and reformat the partitions.
Select Custom storage layout, and then click Done.
Identify the system drive.
The system drive on the DGX A100, DGX H100/H200, and DGX B200 is a RAID 1 array and you should find it easily.
Select the system drive, and then click Format.
Set Format to
ext4
.Do not select
Leave formatted as <filesystem>
.Set Mount to
/
.Set the boot flag on the raw devices.
Identify the system drives under
AVAILABLE DEVICES
(not the RAID array) and selectUse as Boot Device
for the first device. On DGX A100, DGX H100/H200, and DGX B200 that have two drives, repeat this process for the second drive and selectUse as another Boot Device
.Complete the configuration.
RAID 0 array: In most cases, the RAID 0 array for the data drives will have already been created from the factory. If it has not been created, you can either create the array in the Storage Configurations dialog or by using the
config_raid_array
tool after completing the Ubuntu installation.Enable drive encryption (Optional): Encryption can only be enabled during the Storage configuration; it cannot be changed after the installation. To change the encryption state again, you need to reinstall the OS. To enable drive encryption, you have to create a virtual group and volume. This is out of the scope for this document. Refer to the Ubuntu documentation for more details.
Swap partition: The default installation does not define a swap partition. Linux uses any configured swap partition for temporarily storing data when the system memory is full, incurring a performance hit. With the large memory of DGX systems swapping is not recommended.
The
FILE SYSTEM SUMMARY
at the top of the page should display the root partition on the RAID 1 drive for and aboot/efi
partition (the two drives will only show up as a single entry).Select Done and accept all changes.
Follow the instructions for the remaining tasks.
Create a default user in the Profile setup dialog and choose any additional SNAP package you want to install in the Featured Server Snaps screen.
Wait for the installation to complete.
Log messages are presented while the installation is running.
Select Reboot Now when the installation is complete to restart the system.
After reboot, you can log in using the username and password for the user you have created above.
Installing the DGX Software Stack#
This section requires that you have already installed Ubuntu on the DGX system and rebooted the system.
Attention
By installing the DGX Software Stack you are confirming that you have read and agree to be bound by the DGX Software License Agreement. You are also confirming that you understand that any pre-release software and materials available that you elect to install in a DGX may not be fully functional, may contain errors or design flaws, and may have reduced or different security, privacy, availability, and reliability standards relative to commercial versions of NVIDIA software and materials, and that you use pre-release versions at your risk.
Installing DGX System Configurations and Tools#
The NVIDIA DGX Software Stack includes system-specific configurations and tools to take advantage of the advanced DGX features. They are provided from NVIDIA repositories in the form of software packages that can be installed on top of a typical Ubuntu installation. All system-specific software components are bundled into metapackages specific to a system:
system-configurations
system-tools
system-tools-extra
For details about the content of these packages, refer to the Release Notes.
The following steps enable the NVIDIA repositories and install the system-specific packages.
Enable the NVIDIA repositories by extracting the repository information.
This step adds the URIs and configuration preferences to control the package versions that will be installed to the
/etc/apt
directory and the GPG keys for the NVIDIA repositories in the/usr/share/keyrings
directory.curl https://repo.download.nvidia.com/baseos/ubuntu/noble/x86_64/dgx-repo-files.tgz | sudo tar xzf - -C /
Update the internal APT database with the latest version information of all packages.
sudo apt update
Recommended: Upgrade all software packages with the latest versions.
sudo apt upgrade
Install the DGX system tools and configurations.
Install Core DGX system packages required for performance.
sudo apt install nvidia-system-core
Install DGX system utility packages, such as NVSM.
sudo apt install nvidia-system-utils
Install DGX system packages, such as automake, build-essential, and vim for development.
sudo apt install nvidia-system-extra
For DGX Station A100 and DGX Station A800 systems, install the
nvidia-system-station
package for a complete desktop, including gnome desktop, xorg, and other desktop packages.sudo apt nvidia-system-station
Install the
linux-tools-nvidia
package.sudo apt install -y linux-tools-nvidia
Install the NVIDIA peermem loader package.
sudo apt install -y nvidia-peermem-loader
Recommended: Disable unattended upgrades.
Ubuntu periodically checks for security and other bug fixes and automatically installs updates software packages typically overnight. Because this may be disruptive, you should regularly check for updates and install them manually.
sudo apt purge -y unattended-upgrades
Configuring Data Drives#
The data drives in the DGX systems can be configured as RAID 0 or RAID 5. RAID 0 provides the maximum storage capacity and performance but does not provide any redundancy.
RAID 0 is often used for data caching. You can use cachefilesd to provide a cache for NFS shares.
Important
You can change the RAID level later but this will destroy the data on those drives.
The RAID configuration can be configured during the Ubuntu installations. If you have already configured the RAID array during the Ubuntu installation, you can skip the first step and go to step 2.
Configure the
/raid
partition.All DGX systems support RAID 0 and RAID 5 arrays.
To create a RAID 0 array:
sudo /usr/bin/configure_raid_array.py -c -f
To create a RAID 5 array:
sudo /usr/bin/configure_raid_array.py -c -f -5
The command creates the
/raid
mount point and RAID array and adds a corresponding entry in/etc/fstab
.(Optional) Install tools for managing the self-encrypting drives (SED) for the data drives on the DGX A100, DGX H100/H200, or DGX B200.
Refer to Managing Self-Encrypting Drives for more information.
(Optional) To use your RAID array for read caching of NFS mounts, you can install
cachefilesd
and set the cachefs option for an NFS share.Install
cachefilesd
andnvidia-conf-cachefilesd
.This will update the cachefilesd configuration to use the
/raid
partition.sudo apt install -y cachefilesd nvidia-conf-cachefilesd
Both
cachefilesd
andnvidia-conf-cachefilesd
might already be installed with thenvidia-system-utils
metapackage.Enable caching on all NFS shares you want to cache by setting the
fsc
flag.Edit
/etc/fstab
and add thefsc
flag to the mount options as shown in this example.<nfs_server>:<export_path> /mnt nfs rw,noatime,rsize=32768,wsize=32768,nolock,tcp,intr,fsc,nofail 0 0
Mount the NFS share.
If the share is already mounted, use the remount option.
mount <mount-point> -o,remount
To validate that caching is enabled, issue the following.
cat /proc/fs/nfsfs/volumes
Look for the text
FSC=yes
in the output of the command. The NFS will be mounted with caching enabled upon subsequent reboot cycles.
Installing the GPU Driver#
You can choose between different GPU driver branches for your DGX system. The latest driver release includes new features but might not provide the same support duration as an older release. Consult the Data Center Driver Release Notes for more details and the minimum required driver release for the GPU architecture.
The DGX B200 system includes the fifth generation of NVIDIA NVLink® and the NVLink Switch technology.
With this version of NVlink, additional packages are included
with Base OS 7 to enable the full NVLink functionality. These packages include
nvlsm
and libnvsdm
among others. When performing GPU driver updates, it is
required to update the driver and the corresponding NVLink stack packages simultaneously.
Updating the DGX B200 system is described in the following steps of the
NVIDIA open GPU kernel module drivers. When updating DGX B200, you should also update
the DOCA packages.
Use the following command to display a list of installed drivers.
Ensure to have the latest version of the package database.
sudo apt update
Display a list of all available drivers.
sudo apt list nvidia-driver*server
Example Output:
nvidia-driver-570-server/noble-updates,noble-security 570.86.19-0ubuntu0.24.04.1 amd64
The following steps install the NVIDIA CUDA driver and configure the system. Replace the release version used as an example (570) with the release you want to install. Ensure that the driver release you intend to install is supported by the GPU in the system.
Ensure you have the latest version of the package database.
sudo apt update
By default, the NVIDIA open GPU kernel module drivers and generic Linux kernel should be installed.
Ensure you have the latest kernel version installed.
The driver package has a dependency on the kernel and updating the database might have updated the version information.
sudo apt install -y linux-generic
The NVIDIA open GPU kernel module drivers are supported and should be installed.
For Fabric Manager systems with the fifth-generation NVLinks, such as DGX B200:
sudo apt install nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570 datacenter-gpu-manager-4 nv-persistence-mode nvslm libnvsdm-570 sudp apt upgrade doca-ofed
For Fabric Manager Systems systems without the fifth-generation NVLinks, such as DGX A100 and DGX H100/H200:
sudo apt install nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570 datacenter-gpu-manager-4 nv-persistence-mode
For non-Fabric Manager systems, such as DGX Station A100 and DGX Station A800, do not install the NVIDIA Fabric Manager service:
sudo apt install nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe datacenter-gpu-manager-4 nv-persistence-mode
Enable the persistenced daemon and other services:
For non-NVswitch systems, such as DGX Station A100 and DGX Station A800:
sudo systemctl enable nvidia-persistenced nvidia-dcgm
For NVswitch systems, such as DGX A100, DGX H100/H200, and DGX B200, make sure also to enable the NVIDIA Fabric Manager service:
sudo systemctl enable nvidia-fabricmanager nvidia-persistenced nvidia-dcgm
Reboot the system to load the drivers and to update system configurations.
Issue reboot.
sudo reboot
After the system has rebooted, verify that the drivers have been loaded and are handling the NVIDIA devices.
nvidia-smi
The output should show all available GPUs and the Persistence-Mode On:
+---------------------------------------------------------------------------+ | NVIDIA-SMI 570.86.10 Driver Version: 570.86.10 CUDA Version: 12.8 | |----------------------------+----------------------+-----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |============================+======================+=======================| | 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 | | N/A 35C P0 42W / 300W | 0MiB / 16160MiB | 0% Default | | | | N/A | +----------------------------+----------------------+-----------------------+ | 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 | | N/A 35C P0 44W / 300W | 0MiB / 16160MiB | 0% Default | | | | N/A | +----------------------------+----------------------+-----------------------+ ... +----------------------------+----------------------+-----------------------+ | 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 | | N/A 35C P0 43W / 300W | 0MiB / 16160MiB | 0% Default | | | | N/A | +----------------------------+----------------------+-----------------------+ +---------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |===========================================================================| | No running processes found | +---------------------------------------------------------------------------+
Installing the DOCA-OFED Package#
DGX systems include high-performance network cards to connect to other systems over InfiniBand or Ethernet. The NVIDIA DOCA™ OFED software, which provides the same functionality of MLNX_OFED, including kernel drivers, user space libraries, and management tools for NVIDIA networking products. For more information about DOCA-OFED, refer to the What IS DOCA-OFED section in MLNX_OFED to DOCA-OFED Transition Guide.
Note
The DOCA-OFED package is not required for DGX Station A100 and DGX Station A800 systems.
To install DOCA-OFED driver and associate packages, install the metapackage
nvidia-system-mlnx-drivers
, which installsdoca-ofed
,mlnx-pxe-setup
,nvidia-mlnx-config
,nvidia-mstflint-loader
, andnvidia-ib-umad-loader
.sudo apt install doca-repo sudo apt install nvidia-system-mlnx-drivers
Reboot the system.
Installing Docker and the NVIDIA Container Toolkit#
Containers provide isolated environments with a full filesystem of the required software for specific applications. To use the NVIDIA provided containers for AI and other frameworks on the DGX and GPUs, you need to install Docker and the NVIDIA Container Toolkit. It takes care of providing access to the GPUs to the software running inside the container.
Note that these tools are also required by the Firmware Update Containers for upgrading the system firmware.
Install docker-ce, NVIDIA Container Toolkit, and optimizations for typical DL workload.
sudo apt install -y docker-ce nvidia-container-toolkit nv-docker-options
Restart the docker daemon.
sudo systemctl restart docker
To validate the installation, run a container and check that it can access the GPUs. The following instructions assume that the NVIDIA GPU driver has been installed and loaded.
Note
This validation downloads a container from the NGC registry and requires that the system has internet access.
Execute the following command to start a container and run the nvidia-smi tool inside the container:
sudo docker run --gpus=all --rm nvcr.io/nvidia/cuda:12.6.2-base-ubuntu24.04 nvidia-smi
Verify that the output shows all available GPUs and has Persistence-Mode set to On.
Installing the NVIDIA System Management (NVSM) Tool [Recommended]#
The NVIDIA System Management (NVSM) is a software framework for monitoring NVIDIA DGX nodes in a data center. It allows customers to get a quick system health report and is typically required by the NVIDIA support team to resolve issues.
The following steps install and configure NVSM.
Install the NVIDIA System Management tool (NVSM):
sudo apt install -y nvsm
(Optional) Modify message-of-the-day (MOTD) to display NVSM health monitoring alerts and release information.
sudo apt install -y nvidia-motd
Additional Software Installed By DGX OS#
The Ubuntu and the NVIDIA repositories provide many additional software packages for a variety of applications. The DGX OS Installer, for example, installs several additional software packages to aid system administration and developers that are not installed by default.
The following steps install the additional software packages that get installed by the DGX OS Installer:
Install additional software for system administration tasks:
sudo apt install -y chrpath cifs-utils fping gdisk iperf ipmitool lsscsi net-tools nfs-common quota rasdaemon pm-utils samba-common samba-libs sysstat vlan
Install additional software for development tasks:
sudo apt install -y build-essential automake bison cmake dpatch flex gcc-multilib gdb g++-multilib libelf-dev libltdl-dev linux-tools-generic m4 swig
The NVIDIA CUDA Developer repository provides an easy mechanism to deploy NVIDIA tools and libraries, such as the CUDA toolkit, cuDNN, or NCCL.
Next Steps and Additional Information#
For further installation and configuration options, refer also to these chapters:
Managing OS and Software Updates - installing additional software and changing driver branches
Network Configuration - additional network options and configurations
Data Storage Configuration - RAID configurations and encryption information
Running NGC Containers - running NGC containers on the system