DGX Software For Red Hat Enterprise Linux 8 Overview

NVIDIA provides a NVIDIA® DGX™ software stack targeted for installation on DGX systems that have been user-installed with Red Hat Enterprise Linux. The software stack provides the same features and functionality that are provided by the original DGX OS Server and DGX OS Desktop software built on the Ubuntu operating system. See also the DGX Software on Red Hat Enterprise Linux 8 Installation Guide.

Current Versions

The following are the current versions available.

Product Current Version
DGX Station EL8-22.08
DGX-2, DGX-1 EL8-22.08
DGX A100 EL8-22.08
DGX Station A100 EL8-22.08

Installing the DGX Software Stack on Red Hat Enterprise Linux 8

Warning: A recently released RHEL kernel upgrade is incompatible with the MOFED driver. Customers should refrain from upgrading their systems until the next MOFED release (ETA mid-to-end September), and wait for the MOFED update before installing RHEL if IB is required.
Important: Before you install or perform the upgrade, refer to the EL8-22.08 section for information about the latest release.

To install the software on a fresh DGX system, see the DGX Software for Red Hat Enterprise Linux 8 - Installation Guide .

Upgrading the DGX Software Stack and Red Hat Enterprise Linux 8

This section provides information about how to update your DGX system while remaining on the same GPU driver branch and how to update your DGX system while switching to a different GPU driver branch.

Important: Here is some important information you need to know before upgrading:
  • An in-place upgrade from RHEL 7 to RHEL 8 with the DGX software stack installed is not supported.
  • Before you install or perform the upgrade, refer to the section in this release notes for the latest RHEL version.
  • Upgrading to a different driver package can result in the server failing to boot. Follow the instructions to first uninstall the current driver.

    Ensure that you are prepared to restrore the GRUB_CMDLINE_LINUX setting as directed in the instructions in this section.

Upgrading the Software without Moving to a New Driver Branch

To update your DGX system with the latest RHEL-8 updates, run the following command:

sudo dnf update -y --nobest

Upgrading the Software and Moving to a New Driver Branch on non-NVSwitch Systems

This procedure applies to the DGX-1, DGX Station, and DGX Station A100 systems.

Important: Before you install or perform the upgrade, refer to the EL8-22.08 section for information about the latest release.
  1. Preserve the GRUB_CMDLINE_LINUX setting.

    Note down the existing GRUB_CMDLINE_LINUX setting in the etc/default/grub file.

    Example:
    GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet rd.driver.blacklist=nouveau"

    The "rd.driver.blacklist=nouveau" parameter was added when installing the driver and should not be included in the restoration.

  2. Issue the following to remove the current driver package and install the new driver package.
    sudo dnf remove -y nv-persistence-mode libnvidia-nscq-<current driver version>
    sudo dnf module remove --all -y nvidia-driver
    sudo dnf module reset -y nvidia-driver
    sudo dnf module install -y nvidia-driver:<new driver version>/{default,src}
    sudo dnf install -y nv-persistence-mode libnvidia-nscq-<new driver version>
    sudo dnf update -y --nobest
  3. Restore the GRUB_CMDLINE_LINUX setting.

    In the /etc/default/grub file, remove extra instances of GRUB_CMDLINE_LINUX and manually edit the file to restore the original setting (except for the blacklist parameter).

    Example:
    GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet"
  4. Reboot the system.
    sudo reboot

Updating the Software and Moving to a New Driver Branch on NVSwitch Systems

This procedure applies to the DGX-2 or DGX A100 systems.

Important: Before you install or perform the upgrade, refer to the EL8-22.08 section for information about the latest release.
  1. Preserve the GRUB_CMDLINE_LINUX setting.

    Note the existing GRUB_CMDLINE_LINUX setting in the etc/default/grub file.

    Example:
    GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet rd.driver.blacklist=nouveau"

    The rd.driver.blacklist=nouveau parameter was added when installing the driver and should not be included in the restoration.

  2. Issue the following to remove the current driver package and install the new driver package.
    sudo dnf remove -y nv-persistence-mode nvidia-fm-enable
    sudo dnf module remove --all -y nvidia-driver
    sudo dnf module reset -y nvidia-driver
    sudo dnf module install -y nvidia-driver:<new driver version>/{fm,src}
    sudo dnf install -y nv-persistence-mode nvidia-fm-enable
    sudo dnf update -y --nobest
  3. Restore the GRUB_CMDLINE_LINUX setting.

    In the /etc/default/grub file, remove any extra instances of GRUB_CMDLINE_LINUX and manually edit the file to restore the original setting (except for the blacklist parameter).

    Example:
    GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet"
  4. Reboot the system.
    sudo reboot