Managing OS and Software Updates#

DGX OS 7 is an optimized version of the Ubuntu 24.04 Linux distribution that provides access to an extensive collection of additional software available from the Ubuntu and NVIDIA repositories. For more information about additional software available from Ubuntu, refer to Install additional applications.

Before you install additional software or upgrade installed software, refer to the Release Notes for the latest release information. To install the additional software, use the apt command or the graphical tool. The graphical tool is only available for the DGX Station A100 systems.

In addition, you can change your GPU driver branch and upgrade to a different CUDA Toolkit release to maintain or optimize the OS for your DGX systems.

Upgrading the System#

Before installing any additional software, you should upgrade the system to the latest versions. This ensures you can access new software releases added to the repositories since your last upgrade. Refer to Upgrading the OS for more information and instructions, including instructions for enabling Ubuntu’s Extended Security Maintenance updates.

Note

  • Before upgrading your system, consult the Release Notes for the upgrade path and supported DGX systems.

  • You will only see the latest software branches after upgrading the DGX OS.

  • When you switch between software branches, such as the GPU driver or CUDA toolkit, you must install the packages for the new branch. Depending on the software, it will then remove the existing branch or support concurrent branches installed on a system.

Changing Your GPU Driver Branch#

NVIDIA drivers are part of the CUDA repository. For more information about the NVIDIA driver release, refer to the release notes in NVIDIA Driver Documentation.

The DGX B300 and DGX B200 system include the fifth generation of NVIDIA NVLink® and the NVLink Switch technology. With this version of NVlink, additional packages are included with Base OS 7 to enable the full NVLink functionality. These packages include nvlsm and libnvsdm among others. When performing GPU driver updates, it is required to update the driver and the corresponding NVLink stack packages simultaneously. Updating the DGX B300 and DGX B200 systems is listed in the steps of the NVIDIA open GPU kernel modules, as described in Upgrading Your GPU Driver Branch.

Checking the Currently Installed Driver Branch#

Before installing a new NVIDIA driver branch, run the following command to check the currently installed driver branch:

apt list --installed nvidia-driver*-open

Determining the New Available Driver Branches#

These steps help you determine which new driver branches are available.

To see the new available NVIDIA driver branches:

  1. Update the local database with the latest information from the Ubuntu repository.

    sudo apt update
    
  2. Show the available NVIDIA open GPU kernel module branches.

    apt list nvidia-driver-*-open
    

    Note

    For current supported driver branches, consult the Supported Drivers and CUDA Toolkit Versions page.

Upgrading Your GPU Driver Branch#

To manually upgrade your driver to the latest branch:

  1. Install the latest kernel.

    • For x86_64 systems:

      sudo apt install -y linux-generic
      
    • For ARM64 systems:

      sudo apt install -y linux-nvidia-64k-hwe-24.04
      
  2. Install the driver pinning package.

    GPU driver 580 and newer releases have introduced significant packaging changes. When installing one of these versions, a new pinning package must be installed before installing the driver to ensure that installed versions of all related packages are consistent.

    The pinning package is named nvidia-driver-pinning-xxx, where xxx is the driver version. For example, before installing the 580 family driver, install nvidia-driver-pinning-580.

    sudo apt install nvidia-driver-pinning-580
    

    After the driver pinning package is installed, proceed with installing the driver packages.

  3. Upgrade the NVIDIA GPU driver.

    Note

    • From the apt install examples below, choose the command set appropriate for your environment. Replace the Release 580 GPU driver with the release family you want to install. For DGX systems, the installed GPU driver release must be 580 or greater.

    • The DGX B300, DGX Spark, and DGX GB300 require the Release 580 family of the NVIDIA open GPU kernel modules.

    • For the Release 580 family, the release branch has been removed from the names of the following packages:

      Release 570

      Release 580

      • nvidia-fabricmanager-570

      • libnvidia-nscq-570

      • libnvdsm-570

      • nvidia-imex-570

      • nvidia-fabricmanager

      • libnvidia-nscq

      • libnvdsm

      • nvidia-imex

    • To install the NVIDIA open GPU kernel modules of a different release family from the current GPU driver, specify the packages with the -open string, for example, nvidia-driver-580-open:

      Note

      In the following commands, the trailing - character in nvidia${GPU_BRANCH}*- specifies that the currently installed GPU driver will be removed in the same transaction. Because this operation removes packages from the system, it is important to perform a dry run first to ensure that the correct packages will be removed.

      GPU_BRANCH=$(dpkg -l | grep -P 'nvidia-driver-(?!pinning-)\d+(-open)?' \
        | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1)
      
      # Specify --dry-run to check the packages to install.
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        "*nvidia*${GPU_BRANCH}*-" -o DPkg::options::="--force-overwrite" --dry-run
      
      # Install the packages.
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        "*nvidia*${GPU_BRANCH}*-" -o DPkg::options::="--force-overwrite"
      
      GPU_BRANCH=$(dpkg -l | grep -P 'nvidia-driver-(?!pinning-)\d+(-open)?' \
        | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1)
      
      # Specify --dry-run to check the packages to install.
      sudo apt install -y nvidia-driver-580-open nvidia-modprobe \
        "*nvidia*${GPU_BRANCH}*-" --dry-run
      
      # Install the packages.
      sudo apt install -y nvidia-driver-580-open nvidia-modprobe \
        "*nvidia*${GPU_BRANCH}*-"
      

      Append the nvidia-imex package:

      GPU_BRANCH=$(dpkg -l | grep -P 'nvidia-driver-(?!pinning-)\d+(-open)?' \
        | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1)
      
      # Specify --dry-run to check the packages to install.
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        nvidia-imex "*nvidia*${GPU_BRANCH}*-" nvidia-imex*- \
        --allow-change-held-packages --dry-run
      
      # Install the packages.
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        nvidia-imex "*nvidia*${GPU_BRANCH}*-" nvidia-imex*- \
        --allow-change-held-packages
      

      Append the nvidia-fabricmanager, nvlsm, and libnvsdm packages:

      GPU_BRANCH=$(dpkg -l | grep -P 'nvidia-driver-(?!pinning-)\d+(-open)?' \
        | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1)
      
      # Specify --dry-run to check the packages to install.
      sudo apt install -y doca-ofed --dry-run
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        nvidia-fabricmanager nvlsm libnvsdm "*nvidia*${GPU_BRANCH}*-" libnvsdm*- \
        --allow-change-held-packages --dry-run
      
      # Install the packages.
      sudo apt install -y doca-ofed
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        nvidia-fabricmanager nvlsm libnvsdm "*nvidia*${GPU_BRANCH}*-" libnvsdm*- \
        --allow-change-held-packages
      

      Append the nvidia-fabricmanager, nvlsm, and libnvsdm packages:

      GPU_BRANCH=$(dpkg -l | grep -P 'nvidia-driver-(?!pinning-)\d+(-open)?' \
        | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1)
      
      # Specify --dry-run to check the packages to install.
      sudo apt install -y doca-ofed --dry-run
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        nvidia-fabricmanager nvlsm libnvsdm "*nvidia*${GPU_BRANCH}*-" libnvsdm*- \
        --allow-change-held-packages --dry-run
      
      # Install the packages.
      sudo apt install -y doca-ofed
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        nvidia-fabricmanager nvlsm libnvsdm "*nvidia*${GPU_BRANCH}*-" libnvsdm*- \
        --allow-change-held-packages
      

      Append the nvidia-fabricmanager package:

      GPU_BRANCH=$(dpkg -l | grep -P 'nvidia-driver-(?!pinning-)\d+(-open)?' \
        | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1)
      
      # Specify --dry-run to check the packages to install.
      sudo apt install -y doca-ofed --dry-run
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        nvidia-fabricmanager "*nvidia*${GPU_BRANCH}*-" --dry-run
      
      # Install the packages.
      sudo apt install -y doca-ofed
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        nvidia-fabricmanager "*nvidia*${GPU_BRANCH}*-"
      
    • To install the NVIDIA open GPU kernel modules of the same release family as the current GPU driver, such as the Release 580, first remove the current driver and then install the packages:

      # Remove the current driver.
      sudo apt-get purge "*nvidia*580*"
      
      # Install the packages.
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe
      
      # Remove the current driver.
      sudo apt-get purge "*nvidia*580*"
      
      # Install the packages.
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe
      

      Append the nvidia-imex package:

      # Remove the current driver.
      sudo apt-get purge "*nvidia*580*"
      
      # Install the packages.
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        nvidia-imex
      

      Append the nvidia-fabricmanager, nvlsm, and libnvsdm packages:

      # Remove the current driver.
      sudo apt-get purge "*nvidia*580*"
      
      # Install the packages.
      sudo apt install -y doca-ofed
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        nvidia-fabricmanager nvlsm libnvsdm
      

      Append the nvidia-fabricmanager, nvlsm, and libnvsdm packages:

      # Remove the current driver.
      sudo apt-get purge "*nvidia*580*"
      
      # Install the packages.
      sudo apt install -y doca-ofed
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        nvidia-fabricmanager nvlsm libnvsdm
      

      Append the nvidia-fabricmanager package:

      # Remove the current driver.
      sudo apt-get purge "*nvidia*580*"
      
      # Install the packages.
      sudo apt install -y doca-ofed
      sudo apt install -y nvidia-driver-580-open libnvidia-nscq nvidia-modprobe \
        nvidia-fabricmanager
      
  4. Reboot the system to ensure the new drivers get loaded:

    sudo reboot
    

Installing or Upgrading to a Newer CUDA Toolkit Release#

Only DGX Station and DGX Station A100 have a CUDA Toolkit release installed by default. DGX servers are intended to be shared resources that use containers and do not have CUDA Toolkit installed by default. However, you have the option to install a qualified CUDA Toolkit release.

Although the DGX OS supports all CUDA Toolkit releases that interoperate with the installed driver, DGX OS releases might include a default CUDA Toolkit release that might not be the most recently released version. Unless you must use a new CUDA Toolkit version that contains the new features, we recommend that you remain on the default version that is included in the DGX OS release. Refer to the DGX OS Software Release Notes for the default CUDA Toolkit release.

Important

Before you install or upgrade to any CUDA Toolkit release, ensure the release is compatible with the driver that is installed on the system. Refer to CUDA Compatibility for more information and a compatibility matrix.

CUDA Compatibility Matrix and Forward Compatibility#

Each CUDA toolkit requires a minimum GPU driver version. This compatibility matrix is documented in CUDA Compatibility.

Newer CUDA Toolkits may be used with older GPU drivers if the appropriate forward compatibility package is installed. Refer to: Installing the CUDA Forward Compatibility Package.

Example:

CUDA Toolkit 12.8 requires GPU driver version 570.86.15; however, the installed GPU driver is the Release 550 GPU driver. To use CUDA Toolkit 12.8 with the older GPU driver, you must install the cuda-compat-12-8 forward compatibility package:

sudo apt install cuda-compat-12-8

You can set the LD_LIBRARY_PATH manually:

LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH

Alternatively, you can configure it automatically by modifying the /etc/ld.so.conf file or by adding a file under the /etc/ld.so.conf.d/ directory.

Checking the Currently Installed CUDA Toolkit Release#

Here is some information about the prerequisites to determine the CUDA Toolkit release that you currently have installed.

Before installing a new CUDA Toolkit release, run the following command to check the currently installed release:

apt list --installed cuda-toolkit-*

The following example output shows that CUDA Toolkit 12.8 is installed:

$ apt list --installed cuda-toolkit-*
Listing... Done
cuda-toolkit-12-8/unknown,now 12.8.1-1 amd64 [installed]
N: There is 1 additional version. Please use the '-a' switch to see it

Installing or Upgrading the CUDA Toolkit#

These steps help you determine which new CUDA Toolkit releases are available.

To see the new available CUDA Toolkit releases:

  1. Update the local database with the latest information from the Ubuntu repository.

    sudo apt update
    
  2. Show all available CUDA Toolkit releases.

    apt list cuda-toolkit-*
    

    The following output shows that 11.8, 12.0, 12.1, and 12.2 are the possible CUDA Toolkit versions that can be installed:

    Listing... Done
    cuda-toolkit-11-8/unknown 11.8.0-1 amd64
    cuda-toolkit-12-0/unknown 12.0.0-1 amd64
    cuda-toolkit-12-1/unknown 12.1.0-1 amd64
    cuda-toolkit-12-2/unknown 12.2.0-1 amd64
    
  3. To install or upgrade the CUDA Toolkit, run the following:

    sudo apt install cuda-toolkit-<version>
    

DOCA-OFED Installation and Configuration#

The NVIDIA DOCA™ OFED software provides the same functionality as MLNX_OFED, including kernel drivers, user space libraries, and management tools for NVIDIA networking products. For more information about DOCA-OFED, refer to the What is DOCA-OFED section in MLNX_OFED to DOCA-OFED Transition Guide. For complete installation information, refer to the NVIDIA DOCA Installation Guide for Linux.

Installing the Latest DOCA-OFED Package#

Install the latest version of NVIDIA DOCA-OFED software by adding the NVIDIA DOCA-OFED repository to your system.

  1. Install the doca-bos8-latest-repo and nvidia-repo-keys packages to automate secure package access.

    sudo apt update
    sudo apt install doca-bos8-latest-repo nvidia-repo-keys -y
    

    Note

    If you encounter a GPG error after running sudo apt update, refer to DOCA Repository GPG Key Error.

  2. Refresh your package list to include the newly added DOCA repository.

    sudo apt update
    
  3. Perform a full upgrade to ensure all dependencies match the new repository.

    sudo apt full-upgrade -y
    
  4. Install the MLNX drivers.

    sudo apt install nvidia-system-mlnx-drivers -y
    

Installing a Specific Version of DOCA-OFED#

To install a specific version of NVIDIA DOCA-OFED software, configure the repository to point to the desired version.

  1. Install the doca-bos8-latest-repo and nvidia-repo-keys packages to automate secure package access.

    sudo apt update
    sudo apt install doca-bos8-latest-repo nvidia-repo-keys -y
    
  2. Configure the DOCA-OFED repository for a specific version.

    Edit the /etc/apt/sources.list.d/doca-bos8-latest.sources file and replace baseos8-latest with your desired version number.

    • For x86_64-based DGX systems:

      For example, to use version 3.0.0-058218 on an x86_64 system:

      URIs: https://linux.mellanox.com/public/repo/doca/3.0.0-058218/ubuntu24.04/x86_64/
      
    • For ARM64-based DGX systems:

      For example, to use version 3.0.0-058218 on an ARM64 system:

      URIs: https://linux.mellanox.com/public/repo/doca/3.0.0-058218/ubuntu24.04/arm64-sbsa/
      
  3. Refresh your package list to include the newly added DOCA repository.

    sudo apt update
    
  4. Perform a full upgrade to ensure all dependencies match the new repository.

    sudo apt full-upgrade -y
    
  5. Install the MLNX drivers.

    sudo apt install nvidia-system-mlnx-drivers -y
    

Installing GPUDirect Storage Support#

NVIDIA Magnum IO GPUDirect Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage. This software avoids a bounce buffer through the CPU.

Note

This section only applies to the following situations:

  • Intend to use GPUDirect Storage in bare metal, but do not use Linux PCI P2PDMA.

  • Intend to upgrade or re-install the DOCA-OFED package and the nvidia-fs module because DGX OS 7 provides the updated versions.

Installing GDS Components#

On DGX servers (DGX B300, DGX B200, H100/H200, H800, A100/A800, GB300, and GB200):

  • Install the nvidia-gds package.

    sudo apt install nvidia-gds
    

On DGX stations (DGX Station A800 and A100):

  • Install the nvidia-gds package.

    sudo apt update
    sudo apt install doca-repo -y
    sudo apt update
    
    sudo apt install nvidia-peermem-loader nvidia-gds mlnx-nvme-dkms mlnx-nfsrdma-dkms -y
    
    MODULE_VERSION=$(dkms status nvidia | cut -d "," -f1)
    sudo dkms remove -m ${MODULE_VERSION} -k $(uname -r) && sudo dkms install -m ${MODULE_VERSION} -k $(uname -r)
    

Enabling Relaxed Ordering for NVMe Drives#

The Samsung NVMe drives used in the NVIDIA DGX systems support relaxed ordering for I/O operations. Relaxed ordering enables the PCIe bus to complete transactions out of order. NVIDIA recommends enabling this setting when you use GPUDirect Storage to improve performance.

  • Run the nvidia-relaxed-ordering-nvme.sh utility.

    sudo /bin/nvidia-relaxed-ordering-nvme.sh enable
    

Note

DGX A100/A800 and DGX Station A100/A800 systems only.

Configuring NVMe Interrupt Coalescing#

The nvidia-nvme-options package, which is installed on all DGX systems, automatically configures NVMe interrupt coalescing on all Samsung and Kioxia drives at each boot. To disable this setting or manually configure the setting, issue the following commands:

To disable the setting:

sudo systemctl stop nvidia-nvme-interrupt-coalescing.service
sudo systemctl disable nvidia-nvme-interrupt-coalescing.service

To configure the setting manually:

sudo /usr/bin/nvidia-nvme-interrupt-coalescing.sh enable
sudo /usr/bin/nvidia-nvme-interrupt-coalescing.sh disable

Next Steps#

Refer to Verifying a Successful GDS Installation in the NVIDIA GPUDirect Storage Installation and Troubleshooting Guide.