Managing and Upgrading Software

DGX OS 6 is an optimized version of the Ubuntu 22.04 Linux distribution with access to a large collection of additional software that is available from the Ubuntu and NVIDIA repositories. You can install the additional software using the apt command or through a graphical tool.

Note

The graphical tool is only available for DGX Station and DGX Station A100.

For more information about additional software available from Ubuntu, refer also to Install additional applications

Before you install additional software or upgrade installed software, refer also to the Release Notes for the latest release information.

Upgrading the System

Before installing any additional software, you should upgrade the system to the latest versions. This ensures that you have access to new software releases that have been added to the repositories since your last upgrade. Refer to Upgrading the OS for more information and instructions including instructions for enabling Ubuntu’s Extended Security Maintenance updates.

Important

You will only see the latest software branches after upgrading DGX OS.

Note

When you switch between software branches, such as the GPU driver or CUDA toolkit, you have to install the package(s) for the new branch. Depending on the software, it will then remove the existing branch or support concurrent branches installed on a system.

Changing Your GPU Branch

NVIDIA drivers are released as precompiled and signed kernel modules by Canonical and are available directly from the Ubuntu repository. Signed drivers are required to verify the integrity of driver packages and identity of the vendor.

However, the verification process requires that Canonical build and release the drivers with Ubuntu kernel updates after their release cycle is complete, and this process might sometimes delay new driver branch releases and updates. For more information about the NVIDIA driver release, refer to the release notes at NVIDIA Driver Documentation.

Important

The Ubuntu repositories provide the following versions of the signed and precompiled NVIDIA drivers:

  • The general NVIDIA display drivers

  • The NVIDIA Data Center GPU drivers

On your DGX system, only install the packages that include the NVIDIA Data Center GPU drivers. The metapackages for the NVIDIA Data Center GPU driver have the -server or -server-open suffix.

Checking the Currently Installed Driver Branch

Before you install a new NVIDIA driver branch, to check the currently installed driver branch, run the following command:

apt list --installed nvidia-driver*server

Determining the New Available Driver Branches

These steps help you determine which new driver branches are available.

To see the new available NVIDIA driver branches:

  1. Update the local database with the latest information from the Ubuntu repository.

    sudo apt update
    
  2. Show all available driver branches.

    apt list nvidia-driver-*-server
    
  3. Optional: Show the available NVIDIA Open GPU Kernel module branches.

    apt list nvidia-driver-*-server-open
    

    Caution

    The NVIDIA Open GPU Kernel module drivers are not supported on NVIDIA DGX-1, DGX-2, and DGX Station systems.

Upgrading Your GPU Branch

To manually upgrade your driver to the latest branch:

  1. Install the latest kernel.

    sudo apt install -y linux-nvidia
    
  2. Install the latest NVIDIA GPU driver.

    In the following commands, the trailing - character in *nvidia*${GPU_BRANCH}*-, specifies to remove the old driver in the same transaction. Because this operation removes packages from the system, it is important to perform a dry-run first and ensure that the correct packages will be removed.

    Set GPU_BRANCH to the latest branch version, such as 535.

    • On non-Fabric Manager systems, such as NVIDIA DGX-1, DGX Station, and DGX Station A100, run the following command:

      GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1)
      
      # Specify --dry-run to check the packages to install.
      sudo apt-get install -y linux-modules-nvidia-535-server-nvidia nvidia-driver-535-server libnvidia-nscq-535 nvidia-modprobe "*nvidia*${GPU_BRANCH}*-" --dry-run
      
      # Install the packages.
      sudo apt-get install -y linux-modules-nvidia-535-server-nvidia nvidia-driver-535-server libnvidia-nscq-535 nvidia-modprobe "*nvidia*${GPU_BRANCH}*-"
      
    • On Fabric Manager systems, NVIDIA DGX-2, DGX A100, and DGX H100, run the same command, but append the nvidia-fabricmanager-535 package:

      GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1)
      
      # Specify --dry-run to check the packages to install.
      sudo apt-get install -y linux-modules-nvidia-535-server-nvidia nvidia-driver-535-server libnvidia-nscq-535 nvidia-modprobe nvidia-fabricmanager-535 "*nvidia*${GPU_BRANCH}*-" --dry-run
      
      # Install the packages.
      sudo apt-get install -y linux-modules-nvidia-535-server-nvidia nvidia-driver-535-server libnvidia-nscq-535 nvidia-modprobe nvidia-fabricmanager-535 "*nvidia*${GPU_BRANCH}*-"
      
    • To install the NVIDIA Open GPU Kernel module drivers, specify the -server-open package name suffix, such as linux-modules-nvidia-535-server-open-nvidia and nvidia-driver-535-server-open. For example,

      # Specify --dry-run to check the packages to install.
      $ sudo apt-get install -y linux-modules-nvidia-535-server-open-nvidia nvidia-driver-535-server-open libnvidia-nscq-535 nvidia-modprobe nvidia-fabricmanager-535 --dry-run
      
      # Install the packages.
      $ sudo apt-get install -y linux-modules-nvidia-535-server-open-nvidia nvidia-driver-535-server-open libnvidia-nscq-535 nvidia-modprobe nvidia-fabricmanager-535
      

    Note

    The driver versions are only used as an example. Replace the value with the version that you want to install.

  3. Reboot the system to ensure the new drivers get loaded:

    sudo reboot
    

Installing or Upgrading to a Newer CUDA Toolkit Release

Only DGX Station and DGX Station A100 have a CUDA Toolkit release installed by default. DGX servers are intended to be shared resources that use containers and do not have CUDA Toolkit installed by default. However, you have the option to install a qualified CUDA Toolkit release.

Although the DGX OS supports all CUDA Toolkit releases that interoperate with the installed driver, DGX OS releases might include a default CUDA Toolkit release that might not be the most recently released version. Unless you must use a new CUDA Toolkit version that contains the new features, we recommend that you remain on the default version that is included in the DGX OS release. Refer to the DGX OS Software Release Notes for the default CUDA Toolkit release.

Important

Before you install or upgrade to any CUDA Toolkit release, ensure the release is compatible with the driver that is installed on the system. Refer to CUDA Compatibility for more information and a compatibility matrix.

CUDA Compatibility Matrix and Forward Compatibility

Each CUDA toolkit requires a minimum GPU driver version. This compatibility matrix is documented in CUDA Compatibility

Newer CUDA Toolkits may be used with older GPU drivers if the appropriate forward compatibility package is installed. Refer to: Installing the Forward Compatibility Package

Example:

CUDA toolkit 12.0 requires GPU driver version 525.60.13, however GPU driver 515.43.04 is installed. In order to use CUDA toolkit 12.0 with the older GPU driver, you must install the cuda-compat-12-0 package:

sudo apt install cuda-compat-12-0

Set either LD_LIBRARY_PATH manually:

LD_LIBRARY_PATH=/usr/local/cuda/compat:$LD_LIBRARY_PATH

or automatically via the /etc/ld.so.conf file or by adding a file under /etc/ld.so.conf.d/.

Checking the Currently Installed CUDA Toolkit Release

Here is some information about the prerequisite to determine the CUDA Toolkit release that you currently have installed.

Before you install a new CUDA Toolkit release, to check the currently installed release, run the following command:

apt list --installed cuda-toolkit-*

The following example output shows that CUDA Toolkit 11.0 is installed:

apt list --installed cuda-toolkit-*
Listing... Done
cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed]
N: There is 1 additional version. Please use the '-a' switch to see it

Installing or Upgrading the CUDA Toolkit

These steps help you determine which new CUDA Toolkit releases are available.

To see the new available CUDA Toolkit releases:

  1. Update the local database with the latest information from the Ubuntu repository.

    sudo apt update
    
  2. Show all available CUDA Toolkit releases.

    apt list cuda-toolkit-*
    

    The following output shows that 11.7, 11.8, 12.0 are the possible CUDA Toolkit versions that can be installed:

    Listing... Done
    cuda-toolkit-11-7/unknown 11.7.1-1 amd64
    cuda-toolkit-11-8/unknown 11.8.0-1 amd64
    cuda-toolkit-12-0/unknown 12.0.0-1 amd64
    
  3. To install or upgrade the CUDA Toolkit, run the following:

    sudo apt install cuda-toolkit-<version>
    

Installing the Mellanox OFED Drivers

DGX OS 6 uses the OFED drivers supplied with the Ubuntu 22.04 distribution. Alternatively, you can install the Mellanox OFED (MOFED) drivers. The Mellanox OFED drivers are tested and packaged by NVIDIA.

DGX OS 6 includes the script /usr/sbin/nvidia-manage-ofed.py to assist in managing the OFED stacks.

Run the following command to display a list of OFED-related packages:

sudo nvidia-manage-ofed.py -s

The command output indicates if the packages are part of the Mellanox stack or the Ubuntu stack.

Using the Mellanox OFED Packages

If you are upgrading from OS 5 to OS 6, refer to Upgrading in the DGX OS 5 User Guide before you change drivers.

  1. Ensure that you have the latest nvidia-manage-ofed package by running these commands:

    sudo apt update
    sudo apt upgrade
    
  2. Remove the inbox OFED components:

    sudo /usr/sbin/nvidia-manage-ofed.py -r ofed
    
  3. Add the Mellanox OFED components:

    sudo /usr/sbin/nvidia-manage-ofed.py -i mofed
    

    Note

    The command installs the latest version of MLNX_OFED that is currently available in the repositories. To install an alternative version other than the latest version, specify the alternative version by using the -v option. The following example installs MLNX_OFED version 5.9-0.5.6.0:

    sudo /usr/sbin/nvidia-manage-ofed.py -i mofed -v 5.9-0.5.6.0
    
  4. Reboot the system.

Using the Ubuntu OFED Packages

  1. Remove the Mellanox OFED components:

    sudo /usr/sbin/nvidia-manage-ofed.py -r mofed
    
  2. Ensure the APT is not configured with the Mellanox repository.

    Remove the /etc/apt/sources.list.d/mlnx.list file.

  3. Update APT so it no longer has information about the packages from the Mellanox repository:

    sudo apt update
    
  4. Add the Ubuntu OFED components:

    sudo /usr/sbin/nvidia-manage-ofed.py -i ofed
    

Inbox OFED vs Mellanox OFED Use Cases

The following table describes common MOFED utilities / use cases, and how to accomplish them with inbox OFED tools.

One key difference between MOFED and inbox OFED is that the /dev/mst* devices aren’t used by inbox OFED. Devices are addressed by their PCIe Bus:Device:Function instead.

Use Case

Mellanox OFED Method

Inbox OFED Method

Correlate IB devices to network devices

ibdev2netdev

rdma link show

Provide information about bond / mtu

net-interfaces

View the contents of:

  • /sys/class/net/bonding_masters

  • /sys/class/net/bond<num>/

  • /proc/net/bonding

Reload OFED drivers

openidb

modprobe

Manipulation with device configuration

mlxconfig

mstconfig

FW image burn

flint

mstflint

Collect debug traces

fwtrace

mstfwtrace

Reset operation on device

mlxfwreset

mstfwreset

Configuration Register Access tool

mcra

mstmcra

Manipulation with host privileges

mlxprivhost

mstprivhost

Dump device internal configuration registers

mlxdump

mstregdump

Read device Vital Product Data

mlxvpd

mstvpd

Initialize on Alloc Performance Impact

The CONFIG_INIT_ON_ALLOC_DEFAULT_ON Linux kernel configuration option controls whether the kernel fills newly allocated pages and heap objects with zeroes by default. You can override this setting with the init_on_alloc=<0|1> kernel parameter. The DGX OS that is preinstalled on NVIDIA DGX System sets init_on_alloc=1 because this setting is the recommended default by Ubuntu for kernel hardening. However, this setting can have a performance impact on the network interface controller performance on DGX systems because zeroing every buffer page upon allocation is frequent and requires time to complete. This option can impact the performance with the inbox OFED driver more than Mellanox OFED (MOFED) driver. The MOFED driver allocates a much larger page cache which tolerates the increased kernel cost of zeroing pages better. NVIDIA recommends that you keep the default setting, init_on_alloc=1 for best security. If your deployment permits less strict security and the network interface controller is underperforming, you can try disabling the security feature.

  1. Edit the /etc/default/grub file and add init_on_alloc=0 to the GRUB_CMDLINE_LINUX_DEFAULT variable.

  2. Generate the GRUB bootloader.

    sudo update-grub
    sudo reboot
    
  3. Optional: After the system reboots, verify the change took effect.

    cat /proc/cmdline
    

    Example Output

    BOOT_IMAGE=/boot/vmlinuz-… init_on_alloc=0

Upgrading Firmware for Mellanox ConnectX Cards

DGX OS 6 uses the open source mstflint program to upgrade firmware on Mellanox cards.

Checking the Device Type

Mstflint uses the PCI-E Bus/Device/Function (BDF) identifier to specify devices. To locate all Mellanox Ethernet and Infiniband devices in the system,

  1. execute the command:

    lspci | grep Mellanox | grep -e Infiniband -e Ethernet
    
  2. The first column of the output is the BDF, for example:

    29:00.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
    29:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
    
  3. To locate the correct firmware for your device you will need the OPN and PSID of the device.

  4. To find the OPN, use the mstvpd command:

    # mstvpd 29:00.0 | grep PN
    PN:       MCX755206AS-NEAT-N
    
  5. To find the PSID, use the mstflint command:

    # mstflint -d 29:00.0 q | grep PSID
    PSID:                     MT_0000000892
    

Download the New Firmware

  1. Navigate to https://network.nvidia.com/support/firmware/firmware-downloads/

  2. Select the product line, e.g.: ConnectX-6 Infiniband.

  3. Select the firmware version.

  4. Select the OPN and PSID that matches your device.

  5. Select “Download” to download the firmware.

  6. Use the “unzip” command to unpack the compressed file to access the .bin file.

Program the Firmware

  1. Use the mstflint command to program the device:

    # mstflint -d 29:00.0 -i fw-ConnectX7-rel-28_36_1010-MCX75310AAS-HEA-N_Ax-UEFI-14.29.14-FlexBoot-3.6.901.signed.bin burn
    
  2. After installing the new firmware, reboot the system.

Installing GPUDirect Storage Support

NVIDIA Magnum IO GPUDirect Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage. This software avoids a bounce buffer through the CPU.

Note

This section only applies if you intend to use GPUDirect Storage in bare metal.

Prerequisites

  • Determine if GDS is supported with your Linux kernel and whether you need to install OFED drivers.

    Optimized NVIDIA Kernel

    Generic kernel

    Ubuntu OFED

    All kernel modules (nvidia-fs, NVMe, NVMf, NFS) related to GDS are part of the Optimized NVIDIA kernel.

    Unsupported

    Mellanox OFED

    All kernel modules related to GDS are part of the Optimized NVIDIA kernel.

    Install MOFED. Refer to Installing Mellanox OFED Drivers and then perform the steps in the following section.

    GDS kernel modules are not present in the Ubuntu generic kernel. These kernel modules are patched via MOFED installation.

    Install MOFED. Refer to Installing Mellanox OFED Drivers and then perform the steps in the following section.

    For additional help, refer to MLNX_OFED Requirements and Installation in the NVIDIA GPUDirect Storage Installation and Troubleshooting Guide.

  • For systems other than NVIDIA DGX-1, DGX-2, and DGX Station, to use the latest GDS version, 12.2.2-1, that is provided by nvidia-fs-dkms-2.17.5-1, you must install an NVIDIA Open GPU Kernel module driver. Refer to Changing Your GPU Branch for more information about installing the driver.

  • For NVIDIA DGX-1, DGX-2, and DGX Station running the generic Linux Kernel, the GPUs in these systems are not supported with the NVIDIA Open GPU Kernel modules. The GDS versions 12.2.2-1 and higher only support the Open GPU Kernel modules.

    For these systems, you must pin the nvidia-fs package to version 2.17.3 or lower and the nvidia-gds package to version 12.2.1-1 or lower.

    1. Create an /etc/apt/preferences.d/nvidia-fs file with contents like the following.

      Package: nvidia-fs
      Pin-Priority: 900
      Pin: version 2.17.3-1
      
      Package: nvidia-fs-dkms
      Pin-Priority: 900
      Pin: version 2.17.3-1
      
      Package: nvidia-gds
      Pin-Priority: 900
      Pin: version 12.2.1-1
      
      Package: nvidia-gds-12-2
      Pin-Priority: 900
      Pin: version 12.2.1-1
      
    2. Verify that the nvidia-fs package preference is correct.

      sudo apt-cache policy nvidia-fs
      

      Example Output

      nvidia-fs:
        Installed: (none)
        Candidate: 2.17.3-1
        Version table:
           2.17.5-1 580
              580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
           2.17.3-1 900
              580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
           ...
      
    3. Verify that the nvidia-gds package preference is correct.

      sudo apt-cache policy nvidia-gds
      

      Example Output

      nvidia-gds:
        Installed: (none)
        Candidate: 12.2.1-1
        Version table:
           12.2.2-1 580
              580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
           12.2.1-1 900
              580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
           ...
      
  • For NVIDIA DGX-1, DGX-2, and DGX Station, disable IOMMU to avoid a DMAR penalty.

    1. Edit the GRUB configuration file.

      sudo vi /etc/default/grub
      
    2. Add intel_iommu=off to the GRUB_CMDLINE_LINUX_DEFAULT variable.

      If the variable already includes other options, enter a space to separate the options. Refer to the following example.

      ...
      GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 intel_iommu=off"
      ...
      
    3. Generate the GRUB bootloader.

      sudo update-grub
      sudo reboot
      
    4. After the system reboots, verify the change took effect.

      cat /proc/cmdline
      

      Example Output

      BOOT_IMAGE=/boot/vmlinuz-... console=tty0 intel_iommu=off
      

Installing GDS Components for the Optimized NVIDIA Kernel

This procedure applies to both the Ubuntu OFED and Mellanox OFED with the Optimized NVIDIA kernel.

The nvidia-fs kernel module is already part of the Optimized NVIDIA kernel.

  • To use GDS with the Optimized NVIDIA kernel, install the CUDA libcufile packages:

    sudo apt install libcufile-<ver> libcufile-dev-<ver> gds-tools-<ver>
    

    Use the CUDA Toolkit version number in place of <ver>, such as 12-2.

Installing GDS Components for the Generic Kernel

The GDS user-space components, libcufile and tools, are required and are installed by performing this procedure.

To use GDS with the Ubuntu generic kernel, perform the following steps:

  1. Set the NVIDIA_DRV_VERSION environment variable to the driver version.

    NVIDIA_DRV_VERSION=$(cat /proc/driver/nvidia/version | grep Module | awk '{print $8}' | cut -d '.' -f 1)
    
  2. Install the nvidia-gds package.

    • For NVIDIA DGX-1, DGX-2, and DGX Station that must use version 12.2.1-1:

      sudo apt install nvidia-gds-12-2=12.2.1-1  nvidia-dkms-${NVIDIA_DRV_VERSION}-server
      
    • For other NVIDIA DGX Systems:

      sudo apt install nvidia-gds-<version>  nvidia-dkms-${NVIDIA_DRV_VERSION}-server
      

      Use the CUDA Toolkit version number in place of <version>, such as 12-2.

Enabling Relaxed Ordering for NVMe Drives

The Samsung NVMe drives used in NVIDIA DGX systems support relaxed ordering for I/O operations. Relaxed ordering enables the PCIe bus to complete transactions out of order. NVIDIA recommends enabling this setting when using GPUDirect Storage to improve performance.

  • Run the nvidia-relaxed-ordering-nvme.sh utility:

    sudo /bin/nvidia-relaxed-ordering-nvme.sh enable
    

Next Steps

Refer to Verifying a Successful GDS Installation in the NVIDIA GPUDirect Storage Installation and Troubleshooting Guide.

Selecting a different Linux kernel

DGX OS 6 uses a kernel optimized for NVIDIA systems. Install the generic Ubuntu kernel by running sudo apt install linux-generic.

Boot the system to the generic kernel one time

  1. To boot from the generic kernel first select “Advance options for DGX OS GNU/Linux” in the grub menu:

    _images/grub-menu.png
  2. Select the generic kernel:

    _images/grub-kernel.png

Boot the system to the generic kernel by default

As shipped, GRUB_DEFAULT is set to 0. This setting indicates to boot the first menu entry on the first menu. The first menu entry is always set to the newest kernel.

To set a different default kernel, you must specify both the menu entry and the submenu item.

  1. View the /boot/grub/grub.cfg file.

    • Locate the Advanced options for DGX OS GNU/Linux submenu line. Copy the ID that follows the $menuentry_id_option string. In the following example, the ID begins with gnulinux-advanced.

      submenu 'Advanced options for DGX OS GNU/Linux' $menuentry_id_option 'gnulinux-advanced-342551e3-c0b6-46da-90c1-d938ff352025'
      
    • Further down in the file, locate the menu entry line for the kernel that you want to boot. In the following example, the menu entry is DGX OS GNU/Linux, with Linux 5.15.0-84-generic.

      Copy the ID that follows the $menuentry_id_option string. In the following example, the ID begins with gnulinux-5.15.0-84-generic.

      menuentry 'DGX OS GNU/Linux, with Linux 5.15.0-84-generic' --class dgx --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-5.15.0-84-generic-advanced-342551e3-c0b6-46da-90c1-d938ff352025'
      
  2. Edit the /etc/default/grub file and set the GRUB_DEFAULT value to the ID of the submenu, the greater than character (>), and the ID of the kernel, like the following example.

    Example

    GRUB_DEFAULT='gnulinux-advanced-342551e3-c0b6-46da-90c1-d938ff352025>gnulinux-5.15.0-84-generic-advanced-342551e3-c0b6-46da-90c1-d938ff352025'
    
  3. Update the grub configuration.

    sudo update-grub2
    

    Example Output

    Sourcing file `/etc/default/grub'
    Sourcing file `/etc/default/grub.d/hugepage.cfg'
    ...
    Adding boot menu entry for UEFI Firmware Settings ...
    done
    

The system uses the newly selected default kernel on the next boot.