Managing OS and Software Updates#
DGX OS 7 is an optimized version of the Ubuntu 24.04 Linux distribution that provides access to an extensive collection of additional software available from the Ubuntu and NVIDIA repositories. For more information about additional software available from Ubuntu, refer to Install additional applications.
Before you install additional software or upgrade installed software, refer to the Release Notes
for the latest release information. To install the additional software, use the apt command or
the graphical tool. The graphical tool is only available for the DGX Station A100 systems.
In addition, you can change your GPU driver branch and upgrade to a different CUDA Toolkit release to maintain or optimize the OS for your DGX systems.
Upgrading the System#
Before installing any additional software, you should upgrade the system to the latest versions. This ensures you can access new software releases added to the repositories since your last upgrade. Refer to Upgrading the OS for more information and instructions, including instructions for enabling Ubuntu’s Extended Security Maintenance updates.
Note
Before upgrading your system, consult the Release Notes for the upgrade path and supported DGX systems.
You will only see the latest software branches after upgrading the DGX OS.
When you switch between software branches, such as the GPU driver or CUDA toolkit, you must install the packages for the new branch. Depending on the software, it will then remove the existing branch or support concurrent branches installed on a system.
Changing Your GPU Driver Branch#
NVIDIA drivers are part of the CUDA repository. For more information about the NVIDIA driver release, refer to the release notes in NVIDIA Driver Documentation.
The DGX B200 system includes the fifth generation of NVIDIA NVLink® and the NVLink Switch technology.
With this version of NVlink, additional packages are included
with Base OS 7 to enable the full NVLink functionality. These packages include
nvlsm and libnvsdm among others. When performing GPU driver updates, it is
required to update the driver and the corresponding NVLink stack packages simultaneously.
Updating the DGX B200 system is listed in the steps of the NVIDIA open GPU kernel modules,
as described in Upgrading Your GPU Driver Branch.
Checking the Currently Installed Driver Branch#
Before installing a new NVIDIA driver branch, run the following command to check the currently installed driver branch:
apt list --installed nvidia-driver*-open
Determining the New Available Driver Branches#
These steps help you determine which new driver branches are available.
To see the new available NVIDIA driver branches:
Update the local database with the latest information from the Ubuntu repository.
sudo apt update
Show the available NVIDIA open GPU kernel module branches.
apt list nvidia-driver-*-open
Upgrading Your GPU Driver Branch#
To manually upgrade your driver to the latest branch:
Install the latest kernel.
For x86_64 systems:
sudo apt install -y linux-generic
For ARM64 systems:
sudo apt install -y linux-nvidia-64k
Upgrade the NVIDIA GPU driver.
Note
From the
apt installexamples below, choose the command set appropriate for your environment. Replace the Release 570 GPU driver with the release family you want to install. For DGX systems, the installed GPU driver release must be 570 or greater.The DGX Spark and DGX GB300 require the Release 580 family of the NVIDIA open GPU kernel modules.
For the Release 580 family, the release branch has been removed from the names of the following packages:
Release 570
Release 580
nvidia-fabricmanager-570libnvidia-nscq-570libnvdsm-570nvidia-imex-570
nvidia-fabricmanagerlibnvidia-nscqlibnvdsmnvidia-imex
To install the NVIDIA open GPU kernel modules of a different release family from the current GPU driver, specify the packages with the
-openstring, for example,nvidia-driver-570-open:Note
In the following commands, the trailing
-character innvidia${GPU_BRANCH}*-specifies that the currently installed GPU driver will be removed in the same transaction. Because this operation removes packages from the system, it is important to perform a dry run first to ensure that the correct packages will be removed.For non-NVSwitch systems, such as DGX Station A100, DGX Station A800, and DGX Spark, run the following commands:
DGX Station A100 and A800:
GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1) # Specify --dry-run to check the packages to install. sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe "*nvidia*${GPU_BRANCH}*-" --dry-run # Install the packages. sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe "*nvidia*${GPU_BRANCH}*-"
DGX Spark:
GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1) # Specify --dry-run to check the packages to install. sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe "*nvidia*${GPU_BRANCH}*-" --dry-run # Install the packages. sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe "*nvidia*${GPU_BRANCH}*-"
For multinode NVLink systems, such as DGX GB200 and GB300, run the same commands using the Release 580 GPU driver, but append the
nvidia-imexpackage:GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1) # Specify --dry-run to check the packages to install. sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe nvidia-imex "*nvidia*${GPU_BRANCH}*-" nvidia-imex*- --allow-change-held-packages --dry-run # Install the packages. sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe nvidia-imex "*nvidia*${GPU_BRANCH}*-" nvidia-imex*- --allow-change-held-packages
For NVSwitch systems with the fifth-generation NVLinks, such as DGX B200, run the same commands using the Release 570 GPU driver, but append the
nvidia-fabricmanager-570 nvlsm libnvsdm-570packages:GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1) # Specify --dry-run to check the packages to install. sudo apt install -y doca-ofed --dry-run sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570 nvlsm libnvsdm-570 "*nvidia*${GPU_BRANCH}*-" libnvsdm*- --allow-change-held-packages --dry-run # Install the packages. sudo apt install -y doca-ofed sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570 nvlsm libnvsdm-570 "*nvidia*${GPU_BRANCH}*-" libnvsdm*- --allow-change-held-packages
To install the Release 580 GPU driver:
GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1) # Specify --dry-run to check the packages to install. sudo apt install -y doca-ofed --dry-run sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe nvidia-fabricmanager nvlsm libnvsdm "*nvidia*${GPU_BRANCH}*-" libnvsdm*- --allow-change-held-packages --dry-run # Install the packages. sudo apt install -y doca-ofed sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe nvidia-fabricmanager nvlsm libnvsdm "*nvidia*${GPU_BRANCH}*-" libnvsdm*- --allow-change-held-packages
For NVSwitch systems without the fifth-generation NVLinks, such as DGX A100, DGX A800, DGX H800, and DGX H100/H200, run the same commands using the Release 570 GPU driver, but append the
nvidia-fabricmanager-570package:GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1) # Specify --dry-run to check the packages to install. sudo apt install -y doca-ofed --dry-run sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570 "*nvidia*${GPU_BRANCH}*-" --dry-run # Install the packages. sudo apt install -y doca-ofed sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570 "*nvidia*${GPU_BRANCH}*-"
To install the NVIDIA open GPU kernel modules of the same release family as the current GPU driver, such as the Release 570,
For non-NVSwitch systems, such as DGX Station A100, DGX Station A800, and DGX Spark, first remove the current driver and then install the package:
DGX Station A100 and A800:
# Remove the current driver. sudo apt-get purge "*nvidia*570*" # Install the packages. sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe
DGX Spark:
# Remove the current driver. sudo apt-get purge "*nvidia*580*" # Install the packages. sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe
For multinode NVLink systems, such as DGX GB200 and DGX GB300, run the same commands using the Release 580 GPU driver, but append the
nvidia-imexpackage:# Remove the current driver. sudo apt-get purge "*nvidia*580*" # Install the packages. sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe nvidia-imex
For NVSwitch systems with the fifth-generation NVLinks, such as DGX B200, run the same commands using the Release 570 GPU driver, but append the
nvidia-fabricmanager-570 nvlsm libnvsdm-570packages:#Remove the current driver. sudo apt-get purge "*nvidia*570*" #Install the packages. sudo apt install -y doca-ofed sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570 nvlsm libnvsdm-570
For NVSwitch systems without the fifth-generation NVLinks, such as DGX A100, DGX A800, DGX H800, and DGX H100/H200, run the same commands using the Release 570 GPU driver, but append the
nvidia-fabricmanager-570package:#Remove the current driver. sudo apt-get purge "*nvidia*570*" #Install the packages. sudo apt install -y doca-ofed sudo apt install -y nvidia-driver-570-open libnvidia-nscq-570 nvidia-modprobe nvidia-fabricmanager-570
Reboot the system to ensure the new drivers get loaded:
sudo reboot
Installing or Upgrading to a Newer CUDA Toolkit Release#
Only DGX Station and DGX Station A100 have a CUDA Toolkit release installed by default. DGX servers are intended to be shared resources that use containers and do not have CUDA Toolkit installed by default. However, you have the option to install a qualified CUDA Toolkit release.
Although the DGX OS supports all CUDA Toolkit releases that interoperate with the installed driver, DGX OS releases might include a default CUDA Toolkit release that might not be the most recently released version. Unless you must use a new CUDA Toolkit version that contains the new features, we recommend that you remain on the default version that is included in the DGX OS release. Refer to the DGX OS Software Release Notes for the default CUDA Toolkit release.
Important
Before you install or upgrade to any CUDA Toolkit release, ensure the release is compatible with the driver that is installed on the system. Refer to CUDA Compatibility for more information and a compatibility matrix.
CUDA Compatibility Matrix and Forward Compatibility#
Each CUDA toolkit requires a minimum GPU driver version. This compatibility matrix is documented in CUDA Compatibility.
Newer CUDA Toolkits may be used with older GPU drivers if the appropriate forward compatibility package is installed. Refer to: Installing the CUDA Forward Compatibility Package.
Example:
CUDA Toolkit 12.8 requires GPU driver version 570.86.15; however, the installed GPU driver is the Release 550 GPU driver. To use CUDA Toolkit 12.8 with the older GPU driver, you must install the cuda-compat-12-8 forward compatibility package:
sudo apt install cuda-compat-12-8
You can set the LD_LIBRARY_PATH manually:
LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH
Alternatively, you can configure it automatically by modifying the /etc/ld.so.conf file or
by adding a file under the /etc/ld.so.conf.d/ directory.
Checking the Currently Installed CUDA Toolkit Release#
Here is some information about the prerequisites to determine the CUDA Toolkit release that you currently have installed.
Before installing a new CUDA Toolkit release, run the following command to check the currently installed release:
apt list --installed cuda-toolkit-*
The following example output shows that CUDA Toolkit 12.8 is installed:
$ apt list --installed cuda-toolkit-*
Listing... Done
cuda-toolkit-12-8/unknown,now 12.8.1-1 amd64 [installed]
N: There is 1 additional version. Please use the '-a' switch to see it
Installing or Upgrading the CUDA Toolkit#
These steps help you determine which new CUDA Toolkit releases are available.
To see the new available CUDA Toolkit releases:
Update the local database with the latest information from the Ubuntu repository.
sudo apt update
Show all available CUDA Toolkit releases.
apt list cuda-toolkit-*
The following output shows that 11.8, 12.0, 12.1, and 12.2 are the possible CUDA Toolkit versions that can be installed:
Listing... Done cuda-toolkit-11-8/unknown 11.8.0-1 amd64 cuda-toolkit-12-0/unknown 12.0.0-1 amd64 cuda-toolkit-12-1/unknown 12.1.0-1 amd64 cuda-toolkit-12-2/unknown 12.2.0-1 amd64
To install or upgrade the CUDA Toolkit, run the following:
sudo apt install cuda-toolkit-<version>
Installing the Latest DOCA-OFED Package#
The NVIDIA DOCA™ OFED software provides the same functionality as MLNX_OFED, including kernel drivers, user space libraries, and management tools for NVIDIA networking products. For more information about DOCA-OFED, refer to the What IS DOCA-OFED section in MLNX_OFED to DOCA-OFED Transition Guide. For installation information, refer to NVIDIA DOCA Installation Guide for Linux.
To install the latest version of NVIDIA DOCA-OFED software:
Uninstall any older versions of DOCA-OFED (or MLNX_OFED) software from your system before proceeding.
for f in $( dpkg --list | grep -E 'doca|flexio|dpa-gdbserver|dpa-stats|dpa-resource-mgmt|dpaeumgmt' | awk '{print $2}' ); do echo $f ; sudo apt remove --purge $f -y ; done sudo /usr/sbin/ofed_uninstall.sh --force sudo apt-get autoremoveAdd the NVIDIA DOCA-OFED repository to your system.
For x86_64-based DGX systems:
$ sudo dd status=none of=/etc/apt/sources.list.d/doca.sources << EOF Types: deb URIs: https://linux.mellanox.com/public/repo/doca/baseos8-latest/ubuntu24.04/x86_64/ Suites: / Signed-By: /usr/share/keyrings/GPG-KEY-Mellanox.gpg EOF $ sudo apt update $ sudo apt install doca-ofed -y
For ARM64-based DGX systems:
$ sudo dd status=none of=/etc/apt/sources.list.d/doca.sources << EOF Types: deb URIs: https://linux.mellanox.com/public/repo/doca/baseos8-latest/ubuntu24.04/arm64-sbsa/ Suites: / Signed-By: /usr/share/keyrings/GPG-KEY-Mellanox.gpg EOF $ sudo apt update $ sudo apt install doca-ofed -y
Note
This sequence sets up the repository for the most recent version of NVIDIA DOCA-OFED. If you need a specific version, replace
baseos8-latestwith the version. For example, to get version 3.0.0-058218 for an x86_64 system, use the https://linux.mellanox.com/public/repo/doca/3.0.0-058218/ubuntu24.04/x86_64/ URI.
Installing GPUDirect Storage Support#
NVIDIA Magnum IO GPUDirect Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage. This software avoids a bounce buffer through the CPU.
Note
This section only applies to the following situations:
Intend to use GPUDirect Storage in bare metal, but do not use Linux PCI P2PDMA.
Intend to upgrade or re-install the DOCA-OFED package and the nvidia-fs module because DGX OS 7 provides the updated versions.
Installing GDS Components#
On DGX servers (DGX B200, H100/H200, H800, A100/A800, GB200, and GB300):
Install the
nvidia-gdspackage.sudo apt install nvidia-gds
On DGX stations (DGX Station A800 and A100):
Install the
nvidia-gdspackage.sudo apt update sudo apt install doca-repo -y sudo apt update sudo apt install nvidia-peermem-loader nvidia-gds mlnx-nvme-dkms mlnx-nfsrdma-dkms -y MODULE_VERSION=$(dkms status nvidia | cut -d "," -f1) sudo dkms remove -m ${MODULE_VERSION} -k $(uname -r) && sudo dkms install -m ${MODULE_VERSION} -k $(uname -r)
Enabling Relaxed Ordering for NVMe Drives#
The Samsung NVMe drives used in the NVIDIA DGX systems support relaxed ordering for I/O operations. Relaxed ordering enables the PCIe bus to complete transactions out of order. NVIDIA recommends enabling this setting when you use GPUDirect Storage to improve performance.
Run the
nvidia-relaxed-ordering-nvme.shutility.sudo /bin/nvidia-relaxed-ordering-nvme.sh enable
Configuring NVMe Interrupt Coalescing#
The nvidia-nvme-options package, which is installed on all DGX systems, automatically configures
NVMe interrupt coalescing on all Samsung and Kioxia drives at each boot. To disable this setting
or manually configure the setting, issue the following commands:
To disable the setting:
sudo systemctl stop nvidia-nvme-interrupt-coalescing.service
sudo systemctl disable nvidia-nvme-interrupt-coalescing.service
To configure the setting manually:
sudo /usr/bin/nvidia-nvme-interrupt-coalescing.sh enable
sudo /usr/bin/nvidia-nvme-interrupt-coalescing.sh disable
Next Steps#
Refer to Verifying a Successful GDS Installation in the NVIDIA GPUDirect Storage Installation and Troubleshooting Guide.