Release Notes Overview

This document provides important release-specific considerations for upgrading the NVIDIA DGX™ Software Stack on your DGX system that runs the Red Hat Enterprise Linux 8 operating system.

Current Versions

The following are the current versions available.

Product

Current Version

DGX Station

EL8-24.07

DGX-2, DGX-1

EL8-24.07

DGX A100

EL8-24.07

DGX A800

EL8-24.07

DGX H100

EL8-24.07

DGX Station A100

EL8-24.07

DGX Station A800

EL8-24.07

Installing the DGX Software Stack on Red Hat Enterprise Linux 8

Caution

A dnf update might cause Linux kernel incompatibility with the currently installed MLNX_OFED networking drivers. To prevent this issue, you can do either of the following tasks:

  • Before upgrading the Linux kernel, check the Linux kernel upgrade version without actually upgrading by using the dnf update --setopt tsflags=test command with the NVIDIA MLNX_OFED supported Linux version.

  • When upgrading the Linux kernel, consider upgrading MLNX_OFED to a version that supports the new kernel.

Important

Before you install or perform the upgrade, refer to the EL8-23.08 section for information about the latest release.

To install the software on a fresh DGX system, see the DGX Software for Red Hat Enterprise Linux 8 - Installation Guide

Upgrading the DGX Software Stack and Red Hat Enterprise Linux 8

This section provides information about how to update your DGX system while remaining on the same GPU driver branch and how to update your DGX system while switching to a different GPU driver branch.

Important

Here is some important information you need to know before upgrading:

  • An in-place upgrade from RHEL 7 to RHEL 8 with the DGX software stack installed is not supported.

  • Before you install or perform the upgrade, refer to the section in this release notes for the latest RHEL version.

  • Upgrading to a different driver package can result in the server failing to boot. Follow the instructions to first uninstall the current driver.

    Ensure that you are prepared to restore the GRUB_CMDLINE_LINUX setting as directed in the instructions in this section.

Upgrading the Software without Moving to a New Driver Branch

To update your DGX system with the latest RHEL-8 updates, run the following command:

sudo dnf update -y --nobest

Upgrading and Moving to a New Driver Branch on non-NVSwitch Systems

This procedure applies to the DGX-1, DGX A100/A800, DGX Station A100, and DGX Station A800 systems.

Important

Before you install or perform the upgrade, refer to the EL8-23.08 section for information about the latest release.

  1. Preserve the GRUB_CMDLINE_LINUX setting.

    Note down the existing GRUB_CMDLINE_LINUX setting in the etc/default/grub file.

    Example:

    GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet rd.driver.blacklist=nouveau"
    

    The "rd.driver.blacklist=nouveau" parameter was added when installing the driver and should not be included in the restoration.

  2. Issue the following to remove the current driver package and install the new driver package.

    sudo dnf remove -y nv-persistence-mode libnvidia-nscq-<current driver version>
    sudo dnf module remove --all -y nvidia-driver
    sudo dnf module reset -y nvidia-driver
    sudo dnf module install -y nvidia-driver:<new driver version>/{default,src}
    sudo dnf install -y nv-persistence-mode libnvidia-nscq-<new driver version>
    sudo dnf update -y --nobest
    
  3. Restore the GRUB_CMDLINE_LINUX setting.

    In the /etc/default/grub file, remove extra instances of GRUB_CMDLINE_LINUX and manually edit the file to restore the original setting (except for the blacklist parameter).

    Example:

    GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet"
    
  4. Reboot the system.

    sudo reboot
    

Updating and Moving to a New Driver Branch on NVSwitch Systems

This procedure applies to the DGX-2 or DGX A100/A800 systems.

Important

Before you install or perform the upgrade, refer to the EL8-23.08 section for information about the latest release.

  1. Preserve the GRUB_CMDLINE_LINUX setting.

    Note the existing GRUB_CMDLINE_LINUX setting in the etc/default/grub file.

    Example:

    GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet rd.driver.blacklist=nouveau"
    

    The rd.driver.blacklist=nouveau parameter was added when installing the driver and should not be included in the restoration.

  2. Issue the following to remove the current driver package and install the new driver package.

    sudo dnf remove -y nv-persistence-mode nvidia-fm-enable
    sudo dnf module remove --all -y nvidia-driver
    sudo dnf module reset -y nvidia-driver
    sudo dnf module install -y nvidia-driver:<new driver version>/{fm,src}
    sudo dnf install -y nv-persistence-mode nvidia-fm-enable
    sudo dnf update -y --nobest
    
  3. Restore the GRUB_CMDLINE_LINUX setting.

    In the /etc/default/grub file, remove any extra instances of GRUB_CMDLINE_LINUX and manually edit the file to restore the original setting (except for the blacklist parameter).

    Example:

    GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet"
    
  4. Reboot the system.

    sudo reboot
    

Switching GPU Driver between pre-R515 and post-R515

While R510 and earlier GPU drivers depend on NSCQ v1, R515 and later require NSCQ v2. This chapter provides the necessary additional instructions when changing driver branches requiring different versions of NSCQ.

Switching from pre-R515 to R515+

To switch from an installed driver branch R510 and earlier to R515 or later:

  1. Update to the latest DGX EL8 to get NVSM 22.09.3 or higher $ sudo dnf update -y

  2. Switch to the R515+ driver branch using the following steps: DGX Software on Red Hat Enterprise Linux 8 Installation Guide

  3. Exclude installing DCGM from the DGX EL8 repo:

    $ grep -q 'exclude=datacenter-gpu-manager' /etc/yum.repos.d/nvidia.repo ||
    $ sudo sed -i '/priority=40/a exclude=datacenter-gpu-manager' /etc/yum.repos.d/nvidia.repo
    
  4. Install the latest DCGM 3.x from the CUDA repo $ sudo dnf install datacenter-gpu-manager

Switching from R515+ to pre-R515

To switch from an installed driver branch R515 and later to R510 or older:

  1. Switch to the pre-R515 driver branch using the following steps:https://docs.nvidia.com/dgx/dgx-rhel8-install-guide/changing-driver-branches.html#changing-driver-branches

  2. Allow DCGM to be installed from the DGX EL8 repo:

    $ sudo sed -i '/exclude=datacenter-gpu-manager/d' /etc/yum.repos.d/nvidia.repo sudo sed -i '/exclude=datacenter-gpu-manager/d' /etc/yum.repos.d/nvidia.repo
    
  3. Downgrade to the latest DCGM 2.x from the DGX EL8 repo:

    $ sudo dnf downgrade datacenter-gpu-manager