Installing DGX Software on Ubuntu#
This sections explains the steps for installing and configuring Ubuntu and the NVIDIA DGX Software Stack on DGX systems.
DGX OS provides a customized installation of Ubuntu with additional software from NVIDIA to provide a turnkey solution for running AI and analytics workloads. The additional software, the NVIDIA DGX Software Stack, comprises platform-specific configurations, diagnostic and monitoring tools, and drivers that are required for a stable, tested, and supported OS to run AI, machine learning, and analytics applications on DGX systems.
You also have the option to install the NVIDIA DGX Software Stack on top of a vanilla Ubuntu distribution while still benefiting from the advanced DGX features. This installation method supports more flexibility, such as custom partition schemes.
Cluster deployments also benefit from this installation method by taking advantage of Ubuntu`s standardized automated and non-interactive installation process. Starting with Ubuntu 20.04, the installer introduced a new mechanism for automating the installation allowing system administrators to install a system unattended and non-interactively. You can find information for creating such a cloud-init configuration file in Cloud-init Configuration File. Refer to Ubuntu Automated Server Installation for more details.
The intended audience are IT professionals managing a cluster of DGX systems and integration partners.
Prerequisites#
The following prerequisites are required or recommended, where indicated.
Ubuntu Software Requirements#
The DGX Software Stack requires the following software versions:
Ubuntu 22.04
Linux Kernel 5.15 LTS
Access to Software Repositories#
The DGX Software Stack is available from repositories that can be accessed from the internet. If your installation does not allow connection to the internet, see appendix Air-Gapped Installations for information about installing and upgrading software on “air-gapped” systems.
If you are using a proxy server, then follow the instructions in the section Network Configuration for setting up a proxy configuration.
Installation Considerations#
Installing the NVIDIA DGX Software Stack on Ubuntu allows you to select from additional configuration options that would otherwise not be available with the preconfigured DGX OS installer. This includes drive partitioning, filesystem choices, and software selection.
Before you start installing Ubuntu and the NVIDIA DGX Software Stack, you should evaluate the following options. The installation and configuration instructions will be covered in the respective section of this document.
System Drive Mirroring (RAID-1) [recommended]#
The DGX H100/H200, DGX A100 and DGX-2 systems embed two system drives for mirroring the OS partitions (RAID-1). This ensures data resiliency if one drive fails. If you want to enable mirroring, you need to enable it during the drive configuration of the Ubuntu installation. It cannot be enabled after the installation.
Data Drive RAID-0 or RAID-5#
DGX systems are equipped with multiple data drives that can be configured as RAID-0 for performance or RAID-5 for resiliency. RAID-0 provides the maximum storage capacity and performance, but does not provide any redundancy. If a single SSD in the array fails, all data stored on the array is lost.
RAID-0 is recommended for data caching. You can use cachefilesd to provide caching for NFS shares. The network file system (NFS) is required to take advantage of the cache file system. RAID-5 should be used for persistent data storage.
You have the option to configure RAID and data caching after the initial Ubuntu
installation using the nvidia-config-raid
tool or during the Ubuntu
installation. The nvidia-config-raid tool is recommended for manual
installation.
Note
The DGX-1 uses a hardware RAID controller that cannot be configured during the Ubuntu installation. You can still use the nvidia-config-raid
tool or change the configuration in the BIOS.
System Drive Encryption [optional]#
Root filesystem encryption is a software-based method to protect the content stored in the system partition(s) from unauthorized access by encrypting the data on-the-fly. It requires users to unlock the filesystem on every boot, either manually by entering a passphrase or automatically using a centralized key server.
System drive encryption can only be enabled during the installation of Ubuntu.
Data Drive Encryption [optional]#
Data drive encryption is only supported on DGX H100/H200 and DGX A100 systems equipped with
self-encrypting-drives (SED). It can be enabled after Ubuntu is installed using
the nv-encrypt
tool. It requires either to store the keys in the TPM or
external key-management.
System Drive Partitioning#
Ubuntu uses only a single partition for the entire filesystem by default. This can be configured during the Ubuntu installation for deployments that require a more faceted partition scheme for security reasons. The recommended partitioning scheme is to use only a single partition for the Linux root partition with the ext4 filesystem.
Installing Ubuntu#
There are several methods for installing Ubuntu as described in the Ubuntu Server Guide.
For convenience, this section provides additional instructions that are specific to DGX for installing Ubuntu following the Basic Installation. If you have a preferred method for installing Ubuntu, then you can skip this section.
Steps that are covered in this section:
Connecting to the DGX system
Booting from the install media
Running the Ubuntu installer (including network and storage configuration steps)
Booting from the Installation Media#
During the installation and configuration steps, you need to connect to the console of the DGX system. Refer to Connecting to the DGX System for more details.
Boot the Ubuntu ISO image in one of the following ways:
Remotely through the BMC for systems that provide a BMC.
Refer to the Reimaging the System Remotely section in the corresponding DGX user guide listed above for instructions.
Locally from a UEFI-bootable USB flash drive or DVD-ROM.
Refer to Installing the DGX OS Image from a USB Flash Drive or DVD-ROM section in the corresponding DGX user guide listed above for instructions.
Running the Ubuntu Installer#
After booting the ISO image, the Ubuntu installer should start and guide you through the installation process.
Note
The screenshots in the following steps are taken from a DGX A100. Other DGX systems have differences in drive partitioning and networking.
During the boot process of the ISO image, you might see some error messages due to older drivers, etc. They can be safely ignored.
Select your language at the welcome screen, then follow the instructions to select whether to update the installer (if offered) and to choose your keyboard..
At the Network connections screen, configure your network.
The installer tries to automatically retrieve a DHCP address for all network interfaces, so you should be able to continue without any changes. However, you also have the option to manually configure the interface(s).
At the Guided storage configuration screen, configure the partitioning and file systems. All DGX systems are shipped preinstalled with DGX OS. The drives are, therefore, already partitioned and formatted. DGX OS installer configures a single ext4 partition for the root partition in addition to the EFI boot partition. You have the following options:
Keep the current partition layout and formatting [recommended]
Create a custom partition scheme [advanced]
Use a single disk with the default Ubuntu partition scheme
Creating a new custom partition scheme with a RAID configuration is a more involved process and out of the scope for this document. Refer to the Ubuntu installation guide for more information. When you choose the option to use an entire disk, Ubuntu will only use one of the two redundant boot drives.
Note
The RAID level for the data drive can be changed after the installation of Ubuntu.
The following instructions describe the steps for keeping the current partition layout. It still requires you to re-create and reformat the partitions.
Note
DGX-1 is using a hardware RAID controller and the RAID membership can only be configured in the RAID controller BIOS. The default configuration consists of two virtual devices:
The first virtual device (sda) is a single disk and used as the system drive
Second virtual device (sdb) consists of the rest of the disks for data.
Select Custom storage layout, then click Done.
Identify the system drive.
The system drive on the DGX-2, DGX A100 and DGX H100/H200 is a RAID 1 array and you should find it easily. The DGX-1 has a hardware RAID controller and you will see a single drive as sda.
Select the system drive and then click Format.
Set Format to ext4 (do not select “Leave formatted as <filesystem>”)
Set Mount to “/”:
Set the boot flag on the raw devices.
Identify the system drives under AVAILABLE DEVICES (not the RAID array) and select “Use as Boot Device” for the first device. On DGX-2, DGX A100, and DGX H100/H200 that have two drives, repeat this process for the second drive and select “Use as another Boot Device”.
Complete the configuration.
RAID 0 Array: In most cases, the RAID 0 array for the data drives will have already been created from the factory. If it hasn’t been created you can either create them in the Storage configurations dialog or by using the
config_raid_array
tool after completing the Ubuntu installation.Enable drive encryption (Optional): Note that encryption can only be enabled during the Storage configuration. It cannot be changed after the installation. To change the encryption state again you need to reinstall the OS. To enable drive encryption, you have to create a virtual group and volume. This is out of the scope for this document. Please refer to the Ubuntu documentation for more details.
Swap Partition: The default installation does not define a swap partition. Linux uses any configured swap partition for temporarily storing data when the system memory is full, incurring a performance hit. With the large memory of DGX systems swapping is not recommended.
The “FILE SYSTEM SUMMARY” at the top of the page should display the root partition on the RAID 1 drive for and a boot/efi partition (the two drives will only show up as a single entry). On DGX-1 with the hardware RAID controller, it will show the root partition on sda.
Select Done and accept all changes.
Follow the instructions for the remaining tasks.
Create a default user in the Profile setup dialog and choose any additional SNAP package you want to install in the Featured Server Snaps screen.
Wait for the installation to complete.
Log messages are presented while the installation is running.
Select Reboot Now when the installation is complete to restart the system.
After reboot, you can log in using the username and password for the user you have created above.
When using LVM, Ubuntu’s default partitioning scheme, DGX-2 users may run into https://bugs.launchpad.net/ubuntu/+source/lvm2/+bug/1834250. The “/dev/sda: open failed: No medium found” messages are harmless, and can be avoided by updating /etc/lvm/lvm.conf with the following filter: “global_filter = ["r|/dev/sda|"]":
Installing the DGX Software Stack#
This section requires that you have already installed Ubuntu on the DGX and rebooted the system.
Attention
By installing the DGX Software Stack you are confirming that you have read and agree to be bound by the DGX Software License Agreement. You are also confirming that you understand that any pre-release software and materials available that you elect to install in a DGX may not be fully functional, may contain errors or design flaws, and may have reduced or different security, privacy, availability, and reliability standards relative to commercial versions of NVIDIA software and materials, and that you use pre-release versions at your risk.
Installing DGX System Configurations and Tools#
The NVIDIA DGX Software Stack includes system-specific configurations and tools to take advantage of the advanced DGX features. They are provided from NVIDIA repositories in the form of software packages that can be installed on top of a typical Ubuntu installation. All system-specific software components are bundled into meta packages specific to a system:
system-configurations
system-tools
system-tools-extra
For details about the content of these packages, refer to the Release Notes.
The following steps enable the NVIDIA repositories and install the system specific packages.
Enable the NVIDIA repositories by extracting the repository information.
This step adds the URIs and configuration preferences to control the package versions that will be installed to the
/etc/apt
directory and the GPG keys for the NVIDIA repositories in the/usr/share/keyrings
directory.curl https://repo.download.nvidia.com/baseos/ubuntu/jammy/dgx-repo-files.tgz | sudo tar xzf - -C /
Update the internal APT database with the latest version information of all packages.
sudo apt update
Recommended: Upgrade all software packages with the latest versions.
sudo apt upgrade
Install the DGX system tools and configurations.
For DGX-1, install the DGX-1 configurations and DGX-1 system tools:
sudo apt install -y dgx1-system-configurations dgx1-system-tools dgx1-system-tools-extra
For DGX-2, install the DGX-2 configurations and DGX-2 system tools:
sudo apt install -y dgx2-system-configurations dgx2-system-tools dgx2-system-tools-extra
For DGX A100, install DGX A100 configurations and DGX A100 system tools:
sudo apt install -y dgx-a100-system-configurations dgx-a100-system-tools dgx-a100-system-tools-extra nvidia-utils-525-server
For DGX H100/H200, install DGX H100/H200 configurations and DGX H100/H200 system tools:
sudo apt install -y dgx-h100-system-configurations dgx-h100-system-tools dgx-h100-system-tools-extra nvidia-utils-525-server nvfwupd
Install the
linux-tools-nvidia
package.sudo apt install -y linux-tools-nvidia
Install the NVIDIA peermem loader package.
sudo apt install -y nvidia-peermem-loader
Recommended: Disable unattended upgrades.
Ubuntu periodically checks for security and other bug fixes and automatically installs updates software packages typically overnight. Because this may be disruptive, you should regularly check for updates and install them manually.
sudo apt purge -y unattended-upgrades
Recommended: Enable serial-over-lan console output.
Note
If you have boot drive encryption enabled, the prompt for entering the passphrase and input will be over the serial console if you install this package.
sudo apt install -y nvidia-ipmisol
Optional: Modify the
logrotate
policy to collect more logging information (but size-limited):sudo apt install -y nvidia-logrotate
The configuration changes will take effect only after rebooting the system. To minimize extra reboots, you can defer this step until after the drivers have been installed later in this document.
Configuring Data Drives#
The data drives in the DGX systems can be configured as RAID 0 or RAID 5. RAID 0 provides the maximum storage capacity and performance, but does not provide any redundancy.
RAID 0 is often used for data caching. You can use cachefilesd to provide a cache for NFS shares.
Important
You can change the RAID level later but this will destroy the data on those drives.
Except for the DGX-1, the RAID configuration can be configured during the Ubuntu installations. If you have already configured the RAID array during the Ubuntu installation, you can skip the first step and go to step 2.
Configure the /raid partition.
All DGX systems support RAID 0 and RAID 5 arrays.
To create a RAID 0 array:
sudo /usr/bin/configure_raid_array.py -c -f
To create a RAID 5 array:
sudo /usr/bin/configure_raid_array.py -c -f -5
The command creates the
/raid
mount point and RAID array, and adds a corresponding entry in/etc/fstab
.Optional: Install tools for managing the self-encrypting drives (SED) for the data drives on the DGX A100 or DGX H100/H200.
Refer to Managing Self-Encrypting Drives for more information.
Optional: If you wish to use your RAID array for read caching of NFS mounts, you can install
cachefilesd
and set the cachefs option for an NFS share.Install
cachefilesd
andnvidia-conf-cachefilesd
.This will update the cachefilesd configuration to use the
/raid
partition.sudo apt install -y cachefilesd nvidia-conf-cachefilesd
Enable caching on all NFS shares you want to cache by setting the
fsc
flag.Edit
/etc/fstab
and add thefsc
flag to the mount options as shown in this example.<nfs_server>:<export_path> /mnt nfs rw,noatime,rsize=32768,wsize=32768,nolock,tcp,intr,fsc,nofail 0 0
Mount the NFS share.
If the share is already mounted, use the remount option.
mount <mount-point> -o,remount
To validate that caching is enabled, issue the following.
cat /proc/fs/nfsfs/volumes
Look for the text
FSC=yes
in the output of the command. The NFS will be mounted with caching enabled upon subsequent reboot cycles.
Installing the GPU Driver#
You have the option to choose between different GPU driver branches for your DGX system. The latest driver release includes new features but might not provide the same support duration as an older release. Consult the Data Center Driver Release Notes for more details and the minimum required driver release for the GPU architecture.
Use the following command to display a list of installed drivers.
Ensure to have the latest version of the package database.
sudo apt update
Display a list of all available drivers.
sudo apt list nvidia-driver*server
Example Output:
nvidia-driver-450-server/jammy-updates,jammy-security 450.216.04-0ubuntu0.22.04.1 amd64 nvidia-driver-460-server/jammy-updates,jammy-security 525.161.03-0ubuntu0.22.04.1 amd64 nvidia-driver-525-server/jammy-updates,jammy-security 525.161.03-0ubuntu0.22.04.1 amd64 nvidia-driver-510-server/jammy-updates,jammy-security 515.86.01-0ubuntu0.22.04.1 amd64 nvidia-driver-515-server/jammy-updates,jammy-security 515.86.01-0ubuntu0.22.04.1 amd64 nvidia-driver-525-server/jammy-updates,jammy-security 525.60.13-0ubuntu0.22.04.1 amd64
The following steps install the NVIDIA CUDA driver and configure the system. Replace the release version used as an example (525) with the release you want to install. Ensure that the driver release you intend to install is supported by the GPU in the system.
Ensure you have the latest version of the package database.
sudo apt update
Ensure you have the latest kernel version installed.
The driver package has a dependency to the kernel and updating the database might have updated the version information.
sudo apt install -y linux-nvidia
Install NVIDIA CUDA driver.
For non-NVswitch systems like DGX-1:
sudo apt install -y nvidia-driver-525-server linux-modules-nvidia-525-server-nvidia libnvidia-nscq-525 nvidia-modprobe datacenter-gpu-manager nv-persistence-mode
For NVswitch systems like DGX-2, DGX A100, and DGX H100/H200, be sure to also install the fabric-manager package:
sudo apt install -y nvidia-driver-525-server linux-modules-nvidia-525-server-nvidia libnvidia-nscq-525 nvidia-modprobe nvidia-fabricmanager-525 datacenter-gpu-manager nv-persistence-mode
Enable the persistenced daemon and other services:
For non-NVswitch systems, such as DGX-1:
sudo systemctl enable nvidia-persistenced nvidia-dcgm
For NVswitch systems like DGX-2, DGX A100, and DGX H100/H200, be sure to also enable the NVIDIA fabric manager service:
sudo systemctl enable nvidia-fabricmanager nvidia-persistenced nvidia-dcgm
Reboot the system to load the drivers and to update system configurations.
Issue reboot.
sudo reboot
After the system has rebooted, verify that the drivers have been loaded and are handling the NVIDIA devices.
nvidia-smi
The output should show all available GPUs and the Persistence-Mode On:
+---------------------------------------------------------------------------+ | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 | |----------------------------+----------------------+-----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |============================+======================+=======================| | 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 | | N/A 35C P0 42W / 300W | 0MiB / 16160MiB | 0% Default | | | | N/A | +----------------------------+----------------------+-----------------------+ | 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 | | N/A 35C P0 44W / 300W | 0MiB / 16160MiB | 0% Default | | | | N/A | +----------------------------+----------------------+-----------------------+ ... +----------------------------+----------------------+-----------------------+ | 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 | | N/A 35C P0 43W / 300W | 0MiB / 16160MiB | 0% Default | | | | N/A | +----------------------------+----------------------+-----------------------+ +---------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |===========================================================================| | No running processes found | +---------------------------------------------------------------------------+
Installing the Mellanox OpenFabrics Enterprise Distribution (MLNX_OFED)#
DGX systems include high-performance network cards to connect to other systems over Infiniband or Ethernet. You have the option between the driver included in Ubuntu and the Mellanox OpenFabrics Enterprise Distribution (Mellanox OFED or MOFED).
The following steps describe how to install MOFED and all the required additional software in place of the default OFED Ubuntu driver.
Install the nvidia-manage-ofed package:
sudo apt install -y nvidia-manage-ofed
Remove the inbox OFED components:
sudo /usr/sbin/nvidia-manage-ofed.py -r ofed
Add the Mellanox OFED components:
sudo /usr/sbin/nvidia-manage-ofed.py -i mofed
Note
The command installs the latest version of MLNX_OFED that is currently available in the repositories. To install an alternative version other than the latest version, specify the alternative version by using the
-v
option. The following example installs MLNX_OFED version 5.9-0.5.6.0:sudo /usr/sbin/nvidia-manage-ofed.py -i mofed -v 5.9-0.5.6.0
Reboot the system.
Installing Docker and the NVIDIA Container Toolkit#
Containers provide isolated environments with a full filesystem of the required software for specific applications. To use the NVIDIA provided containers for AI and other frameworks on the DGX and GPUs, you need to install Docker and the NVIDIA Container Toolkit. It takes care of providing access to the GPUs to the software running inside the container.
Note that these tools are also required by the Firmware Update Containers for upgrading the system firmware.
Install docker-ce, NVIDIA Container Toolkit, and optimizations for typical DL workload.
sudo apt install -y docker-ce nvidia-container-toolkit nv-docker-options
Restart the docker daemon.
sudo systemctl restart docker
To validate the installation, run a container and check that it can access the GPUs. The following instructions assume that the NVIDIA GPU driver has been installed and loaded.
Note
This validation downloads a container from the NGC registry and requires that the system has internet access.
Execute the following command to start a container and run the nvidia-smi tool inside the container:
sudo docker run --gpus=all --rm nvcr.io/nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
Verify that the output shows all available GPUs and has Persistence-Mode set to On.
Installing the NVIDIA System Management (NVSM) Tool [Recommended]#
The NVIDIA System Management (NVSM) is a software framework for monitoring NVIDIA DGX nodes in a data center. It allows customers to get a quick health report of the system and is typically required by the NVIDIA support team to resolve issues.
The following steps install and configure NVSM.
Install the NVIDIA System Management tool (NVSM):
sudo apt install -y nvsm
Optional: Modify message-of-the-day (MOTD) to display NVSM health monitoring alerts and release information.
sudo apt install -y nvidia-motd
Additional Software Installed By DGX OS#
The Ubuntu and the NVIDIA repositories provide many additional software packages for a variety of applications. The DGX OS Installer, for example, installs several additional software packages to aid system administration and developers that are not installed by default.
The following steps install the additional software packages that get installed by the DGX OS Installer:
Install additional software for system administration tasks:
sudo apt install -y chrpath cifs-utils fping gdisk iperf ipmitool lsscsi net-tools nfs-common quota rasdaemon pm-utils samba-common samba-libs sysstat vlan
Install additional software for development tasks:
sudo apt install -y build-essential automake bison cmake dpatch flex gcc-multilib gdb g++-multilib libelf-dev libltdl-dev linux-tools-generic m4 swig
The NVIDIA CUDA Developer repository provides an easy mechanism to deploy NVIDIA tools and libraries, such as the CUDA toolkit, cuDNN, or NCCL.
Next Steps and Additional Information#
For further installation and configuration options, refer also to these chapters:
Managing and Upgrading Software - installing additional software and changing driver branches
Network Configuration - additional network options and configurations
Data Storage Configuration - RAID configurations and encryption information
Running NGC Containers - running NGC containers on the system