Upgrading or Installing Additional Software

DGX OS 5 is an optimized version of the Ubuntu 20.04 Linux distribution with access to a large collection of additional software that is available from the Ubuntu and NVIDIA repositories. You can install the additional software using the apt command or through a graphical tool.

Note

The graphical tool is only available for DGX Station and DGX Station A100.

For more information about additional software available from Ubuntu, refer also to Install additional applications

Before you install additional software or upgrade installed software, refer also to the Release Notes for the latest release information.

Upgrading the System

Before installing any additional software, you should upgrade the system to the latest versions. This ensures that you have access to new software releases that have been added to the repositories since your last upgrade. Refer to Upgrading DGX OS for more information and instructions including instructions for enabling Ubuntu’s Extended Security Maintenance updates.

Important

You will only see the latest software branches after upgrading DGX OS.

Note

When you switch between software branches, such as the GPU driver or CUDA toolkit, you have to install the package(s) for the new branch. Depending on the software, it will then remove the existing branch or support concurrent branches installed on a system.

Changing Your GPU Branch

NVIDIA drivers are released as precompiled and signed kernel modules by Canonical and are available directly from the Ubuntu repository. Signed drivers are required to verify the integrity of driver packages and identity of the vendor.

However, the verification process requires that Canonical build and release the drivers with Ubuntu kernel updates after their release cycle is complete, and this process might sometimes delay new driver branch releases and updates. For more information about the NVIDIA driver release, refer to the NVIDIA Driver Documentation.

Important

The Ubuntu repositories provide the following versions of the signed and precompiled NVIDIA drivers:

The general NVIDIA display drivers
The NVIDIA Data Center GPU drivers

On your DGX system, you should only install the packages that include the NVIDIA Data Center GPU drivers. The metapackages for the NVIDIA Data Center GPU driver have the -server suffix .

Checking the Currently Installed Driver Branch

Here is some information about the prerequisite to determining the driver branch that you currently have installed.

Before you install a new NVIDIA driver branch, to check the currently installed driver branch, run the following command:

Copy
Copied!

            
            $ apt list --installed nvidia-driver*server

Determining the New Available Driver Branches

These steps help you determine which new driver branches are available.

To see the new available NVIDIA driver branches:

Copy
Copied!

            
            $ apt list nvidia-driver*server

Upgrading Your GPU Branch

Warning

R510 is a transitional package that automatically transitions you to R515 and should not be installed. Instead, use R470 or R515.

To manually upgrade your driver to the latest branch:

Purge the existing driver.

In this example, the R450 driver packages will be removed first. Whether you upgrade or downgrade the NVIDIA GPU driver, the old drivers should be removed.
Copy

Copied!
```
            
            $ sudo apt-get purge ”*nvidia*450*”
        
```

Install the latest kernel.

Copy
Copied!

            
            $ sudo apt install -y linux-generic

To install the latest NVIDIA GPU driver, for example, R470, complete one of the following tasks:
- On Non-Fabric Manager systems, such as DGX-1, DGX Station V100 (Volta), and DGX Station A100, run the following command:
  Copy
  
  Copied!
```
            
            $ sudo apt install -y linux-modules-nvidia-470-server-generic nvidia-driver-470-server libnvidia-nscq-470 nvidia-modprobe nvidia-conf-xconfig nv-docker-gpus
        
```
- On Fabric Manager systems, such as DGX-2 and DGX A100, run the same command, but append the nvidia-fabricmanager-470 package:
  Copy
  
  Copied!
```
            
            $ sudo apt install -y linux-modules-nvidia-470-server-generic nvidia-driver-470-server libnvidia-nscq-470 nvidia-modprobe nvidia-fabricmanager-470
        
```
Note

The driver versions are only used as examples, and you should replace this value with the version that you want to install.

Before you reboot your DGX-2 or DGX A100 system, enable the nvidia-fabricmanager service.

Copy
Copied!

            
            $ sudo systemctl unmask nvidia-fabricmanager

Copy
Copied!

            
            $ sudo systemctl enable nvidia-fabricmanager

If you are using a DGX-1, DGX-2, or DGX A100, run the following command:

To install the nvidia-peer-memory package:

Copy
Copied!

            
            $ sudo apt install -y --reinstall nvidia-peer-memory-dkms

Restart the nvidia-peer-memory service

Copy
Copied!

            
            $ sudo /usr/sbin/update-rc.d nv_peer_mem defaults

If you are upgrading from a branch older than R515 to a driver branch R515 or newer, or if you are downgrading from a branch R515 or newer to an older branch than R515, install the correct DCGM version. You can skip this step, otherwise.
- If you are upgrading to a branch R515 or newer from a branch older than R515, identify the latest DCGM 3.x version:
  Copy
  
  Copied!
```
            
            datacenter-gpu-manager:
  Installed: 1:3.0.4
  Candidate: 1:3.1.3
  Version table:
     1:3.1.3 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     1:3.0.4 580
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
        100 /var/lib/dpkg/status
  *** 1:2.4.7 600
        580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
        600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
        
```
  Identify the latest DCGM 3.x version. In the case above, this would be “1:3.1.3”. Install the latest DCGM 3.x version:
  Copy
  
  Copied!
```
            
            $ sudo apt install datacenter-gpu-manager=1:3.1.3
        
```
- If you are downgrading to a branch older than R510 from R515 or a newer branch (note that R510 is a transitory package for R515) then install DCGM version 2:
  Copy
  
  Copied!
```
            
            $ sudo apt install datacenter-gpu-manager/$(lsb_release -cs)-updates -y --allow-downgrades
        
```
Note

The driver branches R510 and earlier depend on NSCQ v1, while R515 and later have a dependency on NSCQ v2. They require different releases of DCGM that are hosted in different repositories (DGX and CUDA). The DGX repository is configured with a higher priority to prevent APT from upgrading DCGM to an unsupported version when a driver release R510∑or older is installed.

The steps above override the version to install DCGM 3.x for drivers R515+. Once the installed version is greater than the prioritized version, the APT preferences will no longer be used. Users will be able use APT for DCGM 3.x upgrades as part of the usual “apt upgrade” process.

Installing or Upgrading to a Newer CUDA Toolkit Release

Only DGX Station and DGX Station A100 have a CUDA Toolkit release installed by default. DGX servers are intended to be shared resources that use containers and do not have CUDA Toolkit installed by default. However, you have the option to install a qualified CUDA Toolkit release.

Although the DGX OS supports all CUDA Toolkit releases that interoperate with the installed driver, DGX OS releases might include a default CUDA Toolkit release that might not be the most recently released version. Unless you must use a new CUDA Toolkit version that contains the new features, we recommend that you remain on the default version that is included in the DGX OS release. Refer to the DGX OS Software Release Notes for the default CUDA Toolkit release.

Important

Before you install or upgrade to any CUDA Toolkit release, ensure the release is compatible with the driver that is installed on the system. Refer to CUDA Compatibility for more information and a compatibility matrix.

CUDA Compatibility Matrix and Forward Compatibility

Each CUDA toolkit requires a minimum GPU driver version. This compatibility matrix is documented in CUDA Compatibility: Use the Right Compat Package.

A newer CUDA Toolkit may be used with older GPU drivers if the appropriate forward compatibility package is installed. Refer to CUDA Compatibility for more information.

Checking the Currently Installed CUDA Toolkit Release

Here is some information about the prerequisite to determine the CUDA Toolkit release that you currently have installed.

Important

The CUDA Toolkit is not installed on DGX servers by default, and if you try to run the following command, no installed package will be listed.

Before you install a new CUDA Toolkit release, to check the currently installed release, run the following command:

Copy
Copied!

            
            $ apt list --installed cuda-toolkit-*

For example, the following output shows that CUDA Toolkit 11.0 is installed:

Copy
Copied!

            
            $ apt list --installed cuda-toolkit-*
Listing... Done
cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed]
N: There is 1 additional version. Please use the '-a' switch to see it

Installing or Upgrading the CUDA Toolkit

These steps help you determine which new CUDA Toolkit releases are available.

To see the new available CUDA Toolkit releases:

Update the local database with the latest information from the Ubuntu repository.
Copy

Copied!
```
            
            $ apt update
        
```

Show all available CUDA Toolkit releases.

Copy
Copied!

            
            $ apt list cuda-toolkit-*

The following output shows that 11.0 is already installed and 11.1 and 11.2 are the possible CUDA Toolkit versions that can be installed:

Copy
Copied!

            
            Listing... Done
cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed]
cuda-toolkit-11-1/unknown,unknown 11.1.1-1 amd64
cuda-toolkit-11-2/unknown,unknown 11.2.1-1 amd64

To install or upgrade the CUDA Toolkit, run the following:
Copy

Copied!
```
            
            $ apt install cuda-toolkit-<version>
        
```
Replace with the actual version that you want to install. You only need to specify the first two fields, for example, 11.1 or 11.2.

Installing or Upgrading GPUDirect Storage

NVIDIA Magnum IO GPUDirect Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU.

GPUDirect Storage requires that the CUDA Toolkit is installed. It provides the GDS user space components (libcufile and tools). Refer to Installing or Upgrading the CUDA Toolkit for installation instructions.

Configuring IOMMU

For optimal GDS performance on the DGX-1, DGX-2, and DGX Station, it is recommended to disable the IOMMU to avoid incurring a DMAR penalty. The DGX A100 and DGX Station A100 by default set the IOMMU to passthrough mode to avoid incurring a DMAR penalty, so no additional change is needed for those platforms.

Edit the grub bootloader configuration:

Copy
Copied!

            
            $ sudo vi /etc/default/grub

Add one of the following options to the GRUB_CMDLINE_LINUX_DEFAULT option:
- DGX A100, DGX Station A100: amd_iommu=off
- DGX-1, DGX-2, DGX Station: intel_iommu=off

If there are already other options, enter a space to separate the options, for example:

Copy
Copied!

            
            GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 intelamd_iommu=off"

Upgrade the grub bootloader configuration:

Copy
Copied!

            
            $ sudo update-grub

Reboot the system:

Copy
Copied!

            
            $ sudo reboot

After the system reboots, verify that the change took effect:

Copy
Copied!

            
            $ cat /proc/cmdline

Installing GPUDirect Storage

Perform the following to install the nvidia-gds package with the correct dependencies:

Set the ${NVIDIA_DRV_VERSION} environment variable to the driver version:

Copy
Copied!

            
            $ NVIDIA_DRV_VERSION=$(cat /proc/driver/nvidia/version | grep Module | awk '{print $8}' | cut -d '.' -f 1)

Install the nvidia-gds package:

Copy
Copied!

            
            $ sudo apt install nvidia-gds-<version>  nvidia-dkms-${NVIDIA_DRV_VERSION}-server

Use the CUDA Toolkit version number in place of <version> (for example, 12-0)

For additional information, refer to NVIDIA GPUDirect Storage Installation and Troubleshooting Guide.

Installing nvidia_peermem

For CUDA 11.5.1 and later, if you plan to use Weka FS or IBM SpectrumScale then you need to run:

Copy
Copied!

            
            $ modprobe nvidia_peermem

This will load the module that supports peer-direct capabilities. It is necessary to run this command after reboot of the system.

In order to load the module automatically after every reboot, run the following command:

Copy
Copied!

            
            $ echo "nvidia-peermem" | sudo tee /etc/modules-load.d/nvidia-peermem.conf

Note

If the nvidia_peer_memory module is not loading:

DGX OS 5.1.1 provides nv_peer_mem 1.2 and MLNX_OFED 5.4-3.1.0.0 to resolve an issue discovered in MLNX_OFED 5.4-1.0.3.0. nv_peer_mem 1.2 isn’t compatible with MLNX_OFED <= 5.4-1.0.3.0, and attempting to use nv_peer_mem 1.2 with MLNX_OFED <= 5.4-1.0.3.0 will result in a error such as the one below:

Copy
Copied!

            
            $ cat /var/lib/dkms/nv_peer_mem/1.2/build/make.log
DKMS make.log for nv_peer_mem-1.2 for kernel 5.4.0-92-generic (x86_64)
Wed Jan 5 20:36:09 UTC 2022
INFO: Building with MLNX_OFED from: /usr/src/ofa_kernel/default

If you must use MLNX_OFED <= 5.4-1.0.3.0 and have encountered this issue, then it is recommended to downgrade to nv_peer_mem 1.1.

Copy
Copied!

            
            $ sudo apt install --reinstall nvidia-peer-memory-dkms=1.1-0-nvidia2

Installing nvidia-gds

To use GDS, perform the following. Install nvidia-gds with the correct dependencies.

Set the ${NVIDIA_DRV_VERSION} variable:

Copy
Copied!

            
            $ NVIDIA_DRV_VERSION=$(cat /proc/driver/nvidia/version | grep Module | awk '{print $8}' | cut -d '.' -f 1)

Install nvidia-gds with the correct dependencies.

Copy
Copied!

            
            $ sudo apt install nvidia-gds-<ver> nvidia-dkms-${NVIDIA_DRV_VERSION}-server

Use the CUDA Toolkit version number in place of <ver>; for example, 11-4.