Installing on Ubuntu#

This sections explains the steps for installing and configuring Ubuntu and the NVIDIA DGX Software Stack on DGX systems.

DGX OS provides a customized installation of Ubuntu with additional software from NVIDIA to provide a turnkey solution for running AI and analytics workloads. The additional software, the NVIDIA DGX Software Stack, comprises platform-specific configurations, diagnostic and monitoring tools, and drivers that are required for a stable, tested, and supported OS to run AI, machine learning, and analytics applications on DGX systems.

You also have the option to install the NVIDIA DGX Software Stack on top of a vanilla Ubuntu distribution while still benefiting from the advanced DGX features. This installation method supports more flexibility, such as custom partition schemes.

Cluster deployments also benefit from this installation method by taking advantage of Ubuntu`s standardized automated and non-interactive installation process. Starting with Ubuntu 20.04, the installer introduced a new mechanism for automating the installation allowing system administrators to install a system unattended and non-interactively. You can find information for creating such a cloud-init configuration file in Cloud-init Configuration File. Refer to Ubuntu Automated Server Installation for more details.

The intended audience are IT professionals managing a cluster of DGX systems and integration partners.

Prerequisites#

The following prerequisites are required or recommended, where indicated.

Ubuntu Software Requirements#

The DGX Software Stack requires the following software versions:

Ubuntu 20.04
Linux Kernel 5.4 LTS

Access to Software Repositories#

The DGX Software Stack is available from repositories that can be accessed from the internet. If your installation does not allow connection to the internet, see appendix Air-Gapped Installations for information about installing and upgrading software on “air-gapped” systems.

If you are using a proxy server, then follow the instructions in the section Network Configuration for setting up a proxy configuration.

Installation Considerations#

Installing the NVIDIA DGX Software Stack on Ubuntu allows you to select from additional configuration options that would otherwise not be available with the preconfigured DGX OS installer. This includes drive partitioning, filesystem choices, and software selection.

Before you start installing Ubuntu and the NVIDIA DGX Software Stack, you should evaluate the following options. The installation and configuration instructions will be covered in the respective section of this document.

System Drive Mirroring (RAID-1) [recommended]#

The DGX A100 and DGX-2 systems embed two system drives for mirroring the OS partitions (RAID-1). This ensures data resiliency if one drive fails. If you want to enable mirroring, you need to enable it during the drive configuration of the Ubuntu installation. It cannot be enabled after the installation.

Data Drive RAID-0 or RAID-5#

DGX systems are equipped with multiple data drives that can be configured as RAID-0 for performance or RAID-5 for resiliency. RAID-0 provides the maximum storage capacity and performance, but does not provide any redundancy. If a single SSD in the array fails, all data stored on the array is lost.

RAID-0 is recommended for data caching. You can use cachefilesd to provide caching for NFS shares. The network file system (NFS) is required to take advantage of the cache file system. RAID-5 should be used for persistent data storage.

You have the option to configure RAID and data caching after the initial Ubuntu installation using the nvidia-config-raid tool or during the Ubuntu installation. The nvidia-config-raid tool is recommended for manual installation.

Note

The DGX-1 uses a hardware RAID controller that cannot be configured during the Ubuntu installation. You can still use the nvidia-config-raid tool or change the configuration in the BIOS.

System Drive Encryption [optional]#

Root filesystem encryption is a software-based method to protect the content stored in the system partition(s) from unauthorized access by encrypting the data on-the-fly. It requires users to unlock the filesystem on every boot, either manually by entering a passphrase or automatically using a centralized key server.

System drive encryption can only be enabled during the installation of Ubuntu.

Data Drive Encryption [optional]#

Data drive encryption is only supported on the DGX A100 that is equipped with self-encrypting-drives (SED). It can be enabled after Ubuntu is installed using the nv-encrypt tool. It requires either to store the keys in the TPM or external key-management.

System Drive Partitioning#

Ubuntu uses only a single partition for the entire filesystem by default. This can be configured during the Ubuntu installation for deployments that require a more faceted partition scheme for security reasons. The recommended partitioning scheme is to use only a single partition for the Linux root partition with the ext4 filesystem.

Installing Ubuntu#

There are several methods for installing Ubuntu as described in the Ubuntu Server Guide.

For convenience, this section provides additional instructions that are specific to DGX for installing Ubuntu following the Basic Installation. If you have a preferred method for installing Ubuntu, then you can skip this section.

Steps that are covered in this section:

Connecting to the DGX system
Booting from the install media
Running the Ubuntu installer (including network and storage configuration steps)

Booting from the Installation Media#

During the installation and configuration steps, you need to connect to the console of the DGX system. Refer to Connecting to the DGX System for more details.

Boot the Ubuntu ISO image in one of the following ways:

Remotely through the BMC for systems that provide a BMC.

Refer to the Reimaging the System Remotely section in the corresponding DGX user guide listed above for instructions.
Locally from a UEFI-bootable USB flash drive or DVD-ROM.

Refer to Installing the DGX OS Image from a USB Flash Drive or DVD-ROM section in the corresponding DGX user guide listed above for instructions.

Running the Ubuntu Installer#

After booting the ISO image, the Ubuntu installer should start and guide you through the installation process.

Note

The screenshots in the following steps are taken from a DGX A100. Other DGX systems have differences in drive partitioning and networking.

During the boot process of the ISO image, you might see some error messages due to older drivers, etc. They can be safely ignored.

Select your language at the welcome screen, then follow the instructions to select whether to update the installer (if offered) and to choose your keyboard..
At the Network connections screen, configure your network.

The installer tries to automatically retrieve a DHCP address for all network interfaces, so you should be able to continue without any changes. However, you also have the option to manually configure the interface(s).
At the Guided storage configuration screen, configure the partitioning and file systems. All DGX systems are shipped preinstalled with DGX OS. The drives are, therefore, already partitioned and formatted. DGX OS installer configures a single ext4 partition for the root partition in addition to the EFI boot partition. You have the following options:
- Keep the current partition layout and formatting [recommended]
- Create a custom partition scheme [advanced]
- Use a single disk with the default Ubuntu partition scheme
Creating a new custom partition scheme with a RAID configuration is a more involved process and out of the scope for this document. Refer to the Ubuntu installation guide for more information. When you choose the option to use an entire disk, Ubuntu will only use one of the two redundant boot drives.

Note

The RAID level for the data drive can be changed after the installation of Ubuntu.

The following instructions describe the steps for keeping the current partition layout. It still requires you to re-create and reformat the partitions.
Note

DGX-1 is using a hardware RAID controller and the RAID membership can only be configured in the RAID controller BIOS. The default configuration consists of two virtual devices:
- The first virtual device (sda) is a single disk and used as the system drive
- Second virtual device (sdb) consists of the rest of the disks for data.
1. Select Custom storage layout, then click Done.
2. Identify the system drive.
  
  The system drive on the DGX-2 and DGX A100 is a RAID 1 array and you should find it easily. The DGX-1 has a hardware RAID controller and you will see a single drive as sda.
3. Select the system drive and then click Format.
4. Set Format to ext4 (do not select “Leave formatted as <filesystem>”)
5. Set Mount to “/”:
6. Set the boot flag on the raw devices.
  
  Identify the system drives under AVAILABLE DEVICES (not the RAID array) and select “Use as Boot Device” for the first device. On DGX-2 and DGX A100 that have two drives, repeat this process for the second drive and select “Use as another Boot Device”.
7. Complete the configuration.
  - RAID 0 Array: In most cases, the RAID 0 array for the data drives will have already been created from the factory. If it hasn’t been created you can either create them in the Storage configurations dialog or by using the config_raid_array tool after completing the Ubuntu installation.
  - Enable drive encryption (Optional): Note that encryption can only be enabled during the Storage configuration. It cannot be changed after the installation. To change the encryption state again you need to reinstall the OS. To enable drive encryption, you have to create a virtual group and volume. This is out of the scope for this document. Please refer to the Ubuntu documentation for more details.
  - Swap Partition: The default installation does not define a swap partition. Linux uses any configured swap partition for temporarily storing data when the system memory is full, incurring a performance hit. With the large memory of DGX systems swapping is not recommended.
  The “FILE SYSTEM SUMMARY” at the top of the page should display the root partition on the RAID 1 drive for and a boot/efi partition (the two drives will only show up as a single entry). On DGX-1 with the hardware RAID controller, it will show the root partition on sda.
  
  Select Done and accept all changes.
Follow the instructions for the remaining tasks.

Create a default user in the Profile setup dialog and choose any additional SNAP package you want to install in the Featured Server Snaps screen.
Wait for the installation to complete.

Log messages are presented while the installation is running.
Select Reboot Now when the installation is complete to restart the system.

After reboot, you can log in using the username and password for the user you have created above.

When using LVM, Ubuntu’s default partitioning scheme, DGX-2 users may run into https://bugs.launchpad.net/+source/lvm2/+bug/1834250. The “/dev/sda: open failed: No medium found” messages are harmless, and can be avoided by updating /etc/lvm/lvm.conf with the following filter: “global_filter = ["r|/dev/sda|"]":

Installing the DGX Software Stack#

This section requires that you have already installed Ubuntu on the DGX and rebooted the system.

Attention

By installing the DGX Software Stack you are confirming that you have read and agree to be bound by the DGX Software License Agreement. You are also confirming that you understand that any pre-release software and materials available that you elect to install in a DGX may not be fully functional, may contain errors or design flaws, and may have reduced or different security, privacy, availability, and reliability standards relative to commercial versions of NVIDIA software and materials, and that you use pre-release versions at your risk.

Installing DGX System Configurations and Tools#

The NVIDIA DGX Software Stack includes system-specific configurations and tools to take advantage of the advanced DGX features. They are provided from NVIDIA repositories in the form of software packages that can be installed on top of a typical Ubuntu installation. All system-specific software components are bundled into meta packages specific to a system:

system-configurations
system-tools
system-tools-extra

For details about the content of these packages, refer to the Release Notes.

The following steps enable the NVIDIA repositories and install the system specific packages.

Enable the NVIDIA repositories by extracting the repository information.

This step adds the URIs and configuration preferences to control the package versions that will be installed to the /etc/apt directory and the GPG keys for the NVIDIA repositories in the /usr/share/keyrings directory.
```
curl https://repo.download.nvidia.com/baseos/ubuntu/focal/dgx-repo-files.tgz | sudo tar xzf - -C /
```
Update the internal APT database with the latest version information of all packages.
```
sudo apt update
```
Recommended: Upgrade all software packages with the latest versions.
```
sudo apt upgrade
```

Install the DGX system tools and configurations.

For DGX-1, install the DGX-1 configurations and DGX-1 system tools:

sudo apt install -y dgx1-system-configurations dgx1-system-tools dgx1-system-tools-extra

For DGX-2, install the DGX-2 configurations and DGX-2 system tools:

sudo apt install -y dgx2-system-configurations dgx2-system-tools dgx2-system-tools-extra

For DGX A100, install DGX A100 configurations and DGX A100 system tools:

sudo apt install -y dgx-a100-system-configurations dgx-a100-system-tools dgx-a100-system-tools-extra nvidia-utils-525-server

Disable the ondemand governor to set the governor to performance mode:
```
sudo systemctl disable ondemand
```
Recommended: Disable unattended upgrades.

Ubuntu periodically checks for security and other bug fixes and automatically installs updates software packages typically overnight. Because this may be disruptive, you should regularly check for updates and install them manually.
```
sudo apt purge -y unattended-upgrades
```
Recommended: Enable serial-over-lan console output.

Note

If you have boot drive encryption enabled, the prompt for entering the passphrase and input will be over the serial console if you install this package.
```
sudo apt install -y nvidia-ipmisol
```
Optional: Modify the logrotate policy to collect more logging information (but size-limited):
```
sudo apt install -y nvidia-logrotate
```
The configuration changes will take effect only after rebooting the system. To minimize extra reboots, you can defer this step until after the drivers have been installed later in this document.

Configuring Data Drives#

The data drives in the DGX systems can be configured as RAID 0 or RAID 5. RAID 0 provides the maximum storage capacity and performance, but does not provide any redundancy.

RAID 0 is often used for data caching. You can use cachefilesd to provide a cache for NFS shares.

Important

You can change the RAID level later but this will destroy the data on those drives.

Except for the DGX-1, the RAID configuration can be configured during the Ubuntu installations. If you have already configured the RAID array during the Ubuntu installation, you can skip the first step and go to step 2.

Configure the /raid partition.

All DGX systems support RAID 0 and RAID 5 arrays.
- To create a RAID 0 array:
```
sudo /usr/bin/configure_raid_array.py -c -f
```
- To create a RAID 5 array:
```
sudo /usr/bin/configure_raid_array.py -c -f -5
```
The command creates the /raid mount point and RAID array, and adds a corresponding entry in /etc/fstab
Optional: Install tools for managing the self-encrypting drives (SED) for the data drives on the DGX A100.

This requires to store the keys in the TPM or use external key servers. Refer to the “Managing Self-Encrypting Drives” section in the DGX A100 User Guide for usage information.
1. Install the nv-disk-encrypt package.
```
sudo apt install -y nv-disk-encrypt
```
2. Reboot the system.
```
sudo reboot
```
Optional: If you wish to use your RAID array for read caching of NFS mounts, you can install cachefilesd and set the cachefs option for an NFS share.
1. Install cachefilesd and nvidia-conf-cachefilesd.
  
  This will update the cachefilesd configuration to use the /raid partition.
```
sudo apt install -y cachefilesd nvidia-conf-cachefilesd
```
2. Enable caching on all NFS shares you want to cache by setting the fsc flag.
  
  Edit /etc/fstab and add the fsc flag to the mount options as shown in this example.
```
<nfs_server>:<export_path> /mnt nfs rw,noatime,rsize=32768,wsize=32768,nolock,tcp,intr,fsc,nofail 0 0
```
3. Mount the NFS share.
  
  If the share is already mounted, use the remount option.
```
mount <mount-point> -o,remount
```
4. To validate that caching is enabled, issue the following.
```
cat /proc/fs/nfsfs/volumes
```
  Look for the text FSC=yes in the output of the command. The NFS will be mounted with caching enabled upon subsequent reboot cycles.

Installing the GPU Driver#

You have the option to choose between different GPU driver branches for your DGX system. The latest driver release includes new features but might not provide the same support duration as an older release. Consult the Data Center Driver Release Notes for more details and the minimum required driver release for the GPU architecture.

Use the following command to display a list of installed drivers.

Ensure to have the latest version of the package database.
```
sudo apt update
```

Display a list of all available drivers.

sudo apt list nvidia-driver*server

Example Output:

nvidia-driver-450-server/jammy-updates,jammy-security 450.216.04-0ubuntu0.22.04.1 amd64
nvidia-driver-460-server/jammy-updates,jammy-security 525.161.03-0ubuntu0.22.04.1 amd64
nvidia-driver-525-server/jammy-updates,jammy-security 525.161.03-0ubuntu0.22.04.1 amd64
nvidia-driver-510-server/jammy-updates,jammy-security 515.86.01-0ubuntu0.22.04.1 amd64
nvidia-driver-515-server/jammy-updates,jammy-security 515.86.01-0ubuntu0.22.04.1 amd64
nvidia-driver-525-server/jammy-updates,jammy-security 525.60.13-0ubuntu0.22.04.1 amd64

The following steps install the NVIDIA CUDA driver and configure the system. Replace the release version used as an example (525) with the release you want to install. Ensure that the driver release you intend to install is supported by the GPU in the system.

Ensure to have the latest version of the package database.
```
sudo apt update
```
Ensure you have the latest kernel version installed.

The driver package has a dependency to the kernel and updating the database might have updated the version information.
```
sudo apt install -y linux-generic
```

Install NVIDIA CUDA driver.

For non-NVswitch systems like DGX-1:

sudo apt install -y nvidia-driver-525-server linux-modules-nvidia-525-server-generic libnvidia-nscq-525 nvidia-modprobe datacenter-gpu-manager nv-persistence-mode

For NVswitch systems like DGX-2 and DGX A100, be sure to also install the fabric-manager package:

sudo apt install -y nvidia-driver-525-server linux-modules-nvidia-525-server-generic libnvidia-nscq-525 nvidia-modprobe nvidia-fabricmanager-525 datacenter-gpu-manager nv-persistence-mode

Enable the persistenced daemon and other services:
- For non-NVswitch systems, such as DGX-1:
```
sudo systemctl enable nvidia-persistenced nvidia-dcgm
```
- For NVswitch systems like DGX-2 and DGX A100, be sure to also enable the NVIDIA fabric manager service:
```
sudo systemctl enable nvidia-fabricmanager nvidia-persistenced nvidia-dcgm
```

Reboot the system to load the drivers and to update system configurations.

Issue reboot.
```
sudo reboot
```

After the system has rebooted, verify that the drivers have been loaded and are handling the NVIDIA devices.

nvidia-smi

The output should show all available GPUs and the Persistence-Mode On:

+---------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.4       |
|----------------------------+----------------------+-----------------------+
| GPU Name      Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.  ECC |
| Fan Temp Perf Pwr:Usage/Cap|         Memory-Usage | GPU-Util   Compute M. |
|                            |                      |                MIG M. |
|============================+======================+=======================|
| 0 Tesla V100-SXM2...    On | 00000000:06:00.0 Off |                     0 |
| N/A   35C  P0   42W / 300W |      0MiB / 16160MiB |   0%          Default |
|                            |                      |                   N/A |
+----------------------------+----------------------+-----------------------+
| 1 Tesla V100-SXM2...    On | 00000000:07:00.0 Off |                     0 |
| N/A   35C  P0   44W / 300W |      0MiB / 16160MiB |   0%          Default |
|                            |                      |                   N/A |
+----------------------------+----------------------+-----------------------+
...
+----------------------------+----------------------+-----------------------+
| 7 Tesla V100-SXM2...    On | 00000000:8A:00.0 Off |                     0 |
| N/A  35C  P0   43W / 300W  |      0MiB / 16160MiB |   0%          Default |
|                            |                      |                   N/A |
+----------------------------+----------------------+-----------------------+
+---------------------------------------------------------------------------+
| Processes:                                                                |
| GPU   GI   CI        PID     Type      Process name            GPU Memory |
|       ID   ID                                                  Usage      |
|===========================================================================|
| No running processes found                                                |
+---------------------------------------------------------------------------+

Installing the Mellanox OpenFabrics Enterprise Distribution (MLNX_OFED)#

DGX systems include high-performance network cards to connect to other systems over Infiniband or Ethernet. You have the option between the driver included in Ubuntu and the Mellanox OpenFabrics Enterprise Distribution (Mellanox OFED or MOFED). MOFED provides the necessary drivers and system software required for multi-node GPU applications, allowing the system to transfer data between GPUs of different systems directly (RDMA) without having to copy the data to the system memory. This also requires the nv-peer-memory module.

The following steps install MOFED and all the required additional software.

Install the MOFED Driver

sudo apt install -y mlnx-ofed-all nvidia-mlnx-ofed-misc

Unload the nv-peer-mem module.
```
sudo rmmod nv_peer_mem
```
Enable and start the oenibd service.
```
sudo systemctl enable --now openibd
```

Installing Docker and the NVIDIA Container Toolkit#

Containers provide isolated environments with a full filesystem of the required software for specific applications. To use the NVIDIA provided containers for AI and other frameworks on the DGX and GPUs, you need to install Docker and the NVIDIA Container Toolkit. It takes care of providing access to the GPUs to the software running inside the container.

Note that these tools are also required by the Firmware Update Containers for upgrading the system firmware.

Install docker-ce, NVIDIA Container Toolkit, and optimizations for typical DL workload.
```
sudo apt install -y docker-ce nvidia-container-toolkit nv-docker-options
```
Restart the docker daemon.
```
sudo systemctl restart docker
```

To validate the installation, run a container and check that it can access the GPUs. The following instructions assume that the NVIDIA GPU driver has been installed and loaded.

Note

This validation downloads a container from the NGC registry and requires that the system has internet access.

Execute the following command start a container and run the nvidia-smi tool inside the container:

sudo docker run --gpus=all --rm nvcr.io/nvidia/cuda:11.0-base nvidia-smi

Example Output

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.xxx.xx    Driver Version: 525.xxx.xx    CUDA Version: 11.0   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   35C    P0    42W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   35C    P0    44W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
...
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   35C    P0    43W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Verify that the output shows all available GPUs and has Persistence-Mode set to On.

Installing the NVIDIA System Management (NVSM) Tool [Recommended]#

The NVIDIA System Management (NVSM) is a software framework for monitoring NVIDIA DGX nodes in a data center. It allows customers to get a quick health report of the system and is typically required by the NVIDIA support team to resolve issues.

The following steps install and configure NVSM.

Install the NVIDIA System Management tool (NVSM):
```
sudo apt install -y nvsm
```
Optional: Modify message-of-the-day (MOTD) to display NVSM health monitoring alerts and release information.
```
sudo apt install -y nvidia-motd
```

Additional Software Installed By DGX OS#

The Ubuntu and the NVIDIA repositories provide many additional software packages for a variety of applications. The DGX OS Installer, for example, installs several additional software packages to aid system administration and developers that are not installed by default.

The following steps install the additional software packages that get installed by the DGX OS Installer:

Install additional software for system administration tasks:

sudo apt install -y chrpath cifs-utils fping gdisk iperf ipmitool lsscsi net-tools nfs-common quota rasdaemon pm-utils samba-common samba-libs sysstat vlan

Install additional software for development tasks:

sudo apt install -y build-essential automake bison cmake dpatch flex gcc-multilib gdb g++-multilib libelf-dev libltdl-dev linux-tools-generic m4 swig

The NVIDIA CUDA Developer repository provides an easy mechanism to deploy NVIDIA tools and libraries, such as the CUDA toolkit, cuDNN, or NCCL.

Next Steps and Additional Information#

For further installation and configuration options, refer also to these chapters:

Additional Software - installing additional software and changing driver branches
Network Configuration - additional network options and configurations
Data Storage Configuration - RAID configurations and encryption information
Running NGC Containers - running NGC containers on the system