Managing OS and Software Updates#

DGX OS 7 is an optimized version of the Ubuntu 24.04 Linux distribution that provides access to an extensive collection of additional software available from the Ubuntu and NVIDIA repositories. For more information about additional software available from Ubuntu, refer to Install additional applications.

Before you install additional software or upgrade installed software, refer to the Release Notes for the latest release information. To install the additional software, use the apt command or the graphical tool. The graphical tool is only available for the DGX Station A100 systems.

In addition, you can change your GPU branch and upgrade to a different CUDA Toolkit release to maintain or optimize the OS for your DGX systems.

Upgrading the System#

Before installing any additional software, you should upgrade the system to the latest versions. This ensures that you have access to new software releases added to the repositories since your last upgrade. Refer to Upgrading the OS for more information and instructions including instructions for enabling Ubuntu’s Extended Security Maintenance updates.

Note

Before upgrading your system, consult the Release Notes for the upgrade path and supported DGX systems.
You will only see the latest software branches after upgrading the DGX OS.
When you switch between software branches, such as the GPU driver or CUDA toolkit, you must install the packages for the new branch. Depending on the software, it will then remove the existing branch or support concurrent branches installed on a system.

Changing Your GPU Branch#

NVIDIA drivers are part of the CUDA repository. For more information about the NVIDIA driver release, refer to the release notes in NVIDIA Driver Documentation.

The DGX B200 system includes the fifth generation of NVIDIA NVLink® and the NVLink Switch technology. With this version of NVlink, additional packages are included with Base OS 7 to enable the full NVLink functionality. These packages include nvlsm and libnvsdm among others. When performing GPU driver updates, it is required to update the driver and the corresponding NVLink stack packages simultaneously. Updating the DGX B200 system is listed in the steps of the NVIDIA open GPU kernel module drivers, as described in Upgrading Your GPU Branch.

Checking the Currently Installed Driver Branch#

Before you install a new NVIDIA driver branch, to check the currently installed driver branch, run the following command:

apt list --installed nvidia-driver*-open

Determining the New Available Driver Branches#

These steps help you determine which new driver branches are available.

To see the new available NVIDIA driver branches:

Update the local database with the latest information from the Ubuntu repository.
```
sudo apt update
```
Show the available NVIDIA Open GPU Kernel module branches.
```
apt list nvidia-driver-*-open
```

Upgrading Your GPU Branch#

To manually upgrade your driver to the latest branch:

Install the latest kernel.

For x86_64 systems:
```
sudo apt install -y linux-generic
```
For ARM64 systems:
```
sudo apt install -y linux-nvidia-64k
```

Upgrade the NVIDIA GPU driver.

Note

From the apt install examples below, choose the command set appropriate for your environment. Replace the 570 release of the GPU driver with the release family you want to install.

For DGX systems, the installed GPU driver release must be 570 or greater.

To install the NVIDIA open GPU kernel module drivers of a different release family from the current GPU driver, specify the packages with the -open string, for example, nvidia-driver-570-open:

Note

In the following commands, the trailing - character in nvidia${GPU_BRANCH}*- specifies that the currently installed GPU driver will be removed in the same transaction. Because this operation removes packages from the system, it is important to perform a dry run first to ensure that the correct packages will be removed.

On non-NVSwitch systems, such as DGX Station A100 and DGX Station A800, run the following commands:

GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1)

# Specify --dry-run to check the packages to install.
sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe "*nvidia*${GPU_BRANCH}*-" --dry-run

# Install the packages.
sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe "*nvidia*${GPU_BRANCH}*-"

On multinode NVLink systems, such as DGX GB200, run the same commands, but append the nvidia-imex-570 package:

GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1)

# Specify --dry-run to check the packages to install.
sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-imex-570 "*nvidia*${GPU_BRANCH}*-" --dry-run

# Install the packages.
sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-imex-570 "*nvidia*${GPU_BRANCH}*-"

On NVSwitch systems with the fifth-generation NVLinks, such as DGX B200, run the same commands, but append the nvidia-fabricmanager-570 nvlsm libnvsdm-570 packages:

GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1)

# Specify --dry-run to check the packages to install.
sudo apt install -y doca-ofed --dry-run
sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570 nvlsm libnvsdm-570 "*nvidia*${GPU_BRANCH}*-" --dry-run

# Install the packages.
sudo apt install -y doca-ofed
sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570 nvslm libnvsdm-570 "*nvidia*${GPU_BRANCH}*-"

On NVSwitch systems without the fifth-generation NVLinks, such as DGX A100, DGX A800, DGX H800, and DGX H100/H200, run the same commands, but append the nvidia-fabricmanager-570 package:

GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1)

# Specify --dry-run to check the packages to install.
sudo apt install -y doca-ofed --dry-run
sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570 "*nvidia*${GPU_BRANCH}*-" --dry-run

# Install the packages.
sudo apt install -y doca-ofed
sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570 "*nvidia*${GPU_BRANCH}*-"

To install the NVIDIA open GPU kernel module drivers of the same release family as the current GPU driver, such as the 570 release,

On non-NVSwitch systems, such as DGX Station A100 and DGX Station A800, first remove the current driver and then install the package:

# Remove the current driver.
sudo apt-get purge "*nvidia*570*"

# Install the packages.
sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe

On multinode NVLink systems, such as DGX GB200, run the same commands, but append the nvidia-imex-570 package:

# Remove the current driver.
sudo apt-get purge "*nvidia*570*"

# Install the packages.
sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-imex-570

On NVSwitch systems with the fifth-generation NVLinks, such as DGX B200, run the same commands, but append the nvidia-fabricmanager-570 nvslm libnvsdm-570 packages:

#Remove the current driver.
sudo apt-get purge "*nvidia*570*"

#Install the packages.
sudo apt install -y doca-ofed
sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570 nvlsm libnvsdm-570

On NVSwitch systems without the fifth-generation NVLinks, such as DGX A100, DGX A800, DGX H800, and DGX H100/H200, run the same commands, but append the nvidia-fabricmanager-570 package:

#Remove the current driver.
sudo apt-get purge "*nvidia*570*"

#Install the packages.
sudo apt install -y doca-ofed
sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570

Reboot the system to ensure the new drivers get loaded:
```
sudo reboot
```

Installing or Upgrading to a Newer CUDA Toolkit Release#

Only DGX Station and DGX Station A100 have a CUDA Toolkit release installed by default. DGX servers are intended to be shared resources that use containers and do not have CUDA Toolkit installed by default. However, you have the option to install a qualified CUDA Toolkit release.

Although the DGX OS supports all CUDA Toolkit releases that interoperate with the installed driver, DGX OS releases might include a default CUDA Toolkit release that might not be the most recently released version. Unless you must use a new CUDA Toolkit version that contains the new features, we recommend that you remain on the default version that is included in the DGX OS release. Refer to the DGX OS Software Release Notes for the default CUDA Toolkit release.

Important

Before you install or upgrade to any CUDA Toolkit release, ensure the release is compatible with the driver that is installed on the system. Refer to CUDA Compatibility for more information and a compatibility matrix.

CUDA Compatibility Matrix and Forward Compatibility#

Each CUDA toolkit requires a minimum GPU driver version. This compatibility matrix is documented in CUDA Compatibility.

Newer CUDA Toolkits may be used with older GPU drivers if the appropriate forward compatibility package is installed. Refer to: Installing the CUDA Forward Compatibility Package.

Example:

CUDA toolkit 12.0 requires GPU driver version 525.60.13; however, the installed GPU driver is version 515.43.04. To use CUDA toolkit 12.0 with the older GPU driver, you must install the cuda-compat-12-0 package:

sudo apt install cuda-compat-12-0

Set either LD_LIBRARY_PATH manually:

LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH

or automatically via the /etc/ld.so.conf file or by adding a file under /etc/ld.so.conf.d/.

Checking the Currently Installed CUDA Toolkit Release#

Here is some information about the prerequisite to determine the CUDA Toolkit release that you currently have installed.

Before you install a new CUDA Toolkit release, to check the currently installed release, run the following command:

apt list --installed cuda-toolkit-*

The following example output shows that CUDA Toolkit 11.0 is installed:

apt list --installed cuda-toolkit-*
Listing... Done
cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed]
N: There is 1 additional version. Please use the '-a' switch to see it

Installing or Upgrading the CUDA Toolkit#

These steps help you determine which new CUDA Toolkit releases are available.

To see the new available CUDA Toolkit releases:

Update the local database with the latest information from the Ubuntu repository.
```
sudo apt update
```

Show all available CUDA Toolkit releases.

apt list cuda-toolkit-*

The following output shows that 11.7, 11.8, 12.0 are the possible CUDA Toolkit versions that can be installed:

Listing... Done
cuda-toolkit-11-7/unknown 11.7.1-1 amd64
cuda-toolkit-11-8/unknown 11.8.0-1 amd64
cuda-toolkit-12-0/unknown 12.0.0-1 amd64

To install or upgrade the CUDA Toolkit, run the following:
```
sudo apt install cuda-toolkit-<version>
```

Installing GPUDirect Storage Support#

NVIDIA Magnum IO GPUDirect Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage. This software avoids a bounce buffer through the CPU.

Note

This section only applies to the following situations:

Intend to use GPUDirect Storage in bare metal, but do not use Linux PCI P2PDMA.
Intend to upgrade or re-install the DOCA-OFED package and the nvidia-fs module because DGX OS 7 provides the updated versions.

Installing GDS Components#

On DGX servers (DGX B200, H100/H200, H800, A100/A800, and GB200):

Install the nvidia-gds package.
```
sudo apt install nvidia-gds
```

On DGX stations (DGX Station A800 and A100):

Install the nvidia-gds package.

sudo apt update
sudo apt install doca-repo -y
sudo apt update

sudo apt install nvidia-peermem-loader nvidia-gds mlnx-nvme-dkms mlnx-nfsrdma-dkms -y

MODULE_VERSION=$(dkms status nvidia | cut -d "," -f1)
sudo dkms remove -m ${MODULE_VERSION} -k $(uname -r) && sudo dkms install -m ${MODULE_VERSION} -k $(uname -r)

Enabling Relaxed Ordering for NVMe Drives#

The Samsung NVMe drives used in the NVIDIA DGX systems support relaxed ordering for I/O operations. Relaxed ordering enables the PCIe bus to complete transactions out of order. NVIDIA recommends enabling this setting when you use GPUDirect Storage to improve performance.

Run the nvidia-relaxed-ordering-nvme.sh utility.

sudo /bin/nvidia-relaxed-ordering-nvme.sh enable

Configuring NVMe Interrupt Coalescing#

The nvidia-nvme-options package, which is installed on all DGX systems, automatically configures NVMe interrupt coalescing on all Samsung and Kioxia drives at each boot. To disable this setting or manually configure the setting, issue the following commands:

To disable the setting:

sudo systemctl stop nvidia-nvme-interrupt-coalescing.service
sudo systemctl disable nvidia-nvme-interrupt-coalescing.service

To configure the setting manually:

sudo /usr/bin/nvidia-nvme-interrupt-coalescing.sh enable
sudo /usr/bin/nvidia-nvme-interrupt-coalescing.sh disable

Next Steps#

Refer to Verifying a Successful GDS Installation in the NVIDIA GPUDirect Storage Installation and Troubleshooting Guide.