Base OS - DGX Software for Red Hat Enterprise Linux 8 Release Notes

DGX Software For Red Hat Enterprise Linux 8 Overview

NVIDIA provides a NVIDIA® DGX™ software stack targeted for installation on DGX systems that have been user-installed with Red Hat Enterprise Linux. The software stack provides the same features and functionality that are provided by the original DGX OS Server and DGX OS Desktop software built on the Ubuntu operating system. See also the DGX Software on Red Hat Enterprise Linux 8 Installation Guide

The following are the current versions available.

Product

Current Version

DGX Station

EL8-22.08

DGX-2, DGX-1

EL8-22.08

DGX A100

EL8-22.08

DGX A800

EL8-22.08

DGX Station A100

EL8-22.08

DGX Station A800

EL8-22.08

Warning

A recently released RHEL kernel upgrade is incompatible with the MOFED driver. Customers should refrain from upgrading their systems until the next MOFED release, and wait for the MOFED update before installing RHEL if IB is required.

Important

Before you install or perform the upgrade, refer to the EL8-22.08 section for information about the latest release.

To install the software on a fresh DGX system, see the DGX Software for Red Hat Enterprise Linux 8 - Installation Guide

This section provides information about how to update your DGX system while remaining on the same GPU driver branch and how to update your DGX system while switching to a different GPU driver branch.

Important

Here is some important information you need to know before upgrading:

  • An in-place upgrade from RHEL 7 to RHEL 8 with the DGX software stack installed is not supported.

  • Before you install or perform the upgrade, refer to the section in this release notes for the latest RHEL version.

  • Upgrading to a different driver package can result in the server failing to boot. Follow the instructions to first uninstall the current driver.

    Ensure that you are prepared to restore the GRUB_CMDLINE_LINUX setting as directed in the instructions in this section.

Upgrading the Software without Moving to a New Driver Branch

To update your DGX system with the latest RHEL-8 updates, run the following command:

Copy
Copied!
            

sudo dnf update -y --nobest

Upgrading and Moving to a New Driver Branch on non-NVSwitch Systems

This procedure applies to the DGX-1, DGX A100/A800, DGX Station A100, and DGX Station A800 systems.

Important

Before you install or perform the upgrade, refer to the EL8-22.08 section for information about the latest release.

  1. Preserve the GRUB_CMDLINE_LINUX setting.

    Note down the existing GRUB_CMDLINE_LINUX setting in the etc/default/grub file.

Example

Copy
Copied!
            

GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet rd.driver.blacklist=nouveau" The ``"rd.driver.blacklist=nouveau"`` parameter was added when installing the driver and should not be included in the restoration.

  1. Issue the following to remove the current driver package and install the new driver package.

Copy
Copied!
            

sudo dnf remove -y nv-persistence-mode libnvidia-nscq-<current driver version> sudo dnf module remove --all -y nvidia-driver sudo dnf module reset -y nvidia-driver sudo dnf module install -y nvidia-driver:<new driver version>/{default,src} sudo dnf install -y nv-persistence-mode libnvidia-nscq-<new driver version> sudo dnf update -y --nobest

  1. Restore the GRUB_CMDLINE_LINUX setting.

    In the /etc/default/grub file, remove extra instances of GRUB_CMDLINE_LINUX and manually edit the file to restore the original setting (except for the blacklist parameter).

Example

Copy
Copied!
            

GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet"

  1. Reboot the system.

    Copy
    Copied!
                

    sudo reboot

Updating and Moving to a New Driver Branch on NVSwitch Systems

This procedure applies to the DGX-2 or DGX A100/A800 systems.

  1. Preserve the GRUB_CMDLINE_LINUX setting.

    Note the existing GRUB_CMDLINE_LINUX setting in the etc/default/grub file.

Example

Copy
Copied!
            

GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet rd.driver.blacklist=nouveau" The ``rd.driver.blacklist=nouveau`` parameter was added when installing the driver and should not be included in the restoration.

  1. Issue the following to remove the current driver package and install the new driver package.

Copy
Copied!
            

sudo dnf remove -y nv-persistence-mode nvidia-fm-enable sudo dnf module remove --all -y nvidia-driver sudo dnf module reset -y nvidia-driver sudo dnf module install -y nvidia-driver:<new driver version>/{fm,src} sudo dnf install -y nv-persistence-mode nvidia-fm-enable sudo dnf update -y --nobest

  1. Restore the GRUB_CMDLINE_LINUX setting.

    In the /etc/default/grub file, remove any extra instances of GRUB_CMDLINE_LINUX and manually edit the file to restore the original setting (except for the blacklist parameter).

Example

Copy
Copied!
            

GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet"

  1. Reboot the system.

    Copy
    Copied!
                

    sudo reboot

Switching GPU Driver between pre-R515 and post-R515

While R510 and earlier GPU drivers depend on NSCQ v1, R515 and later require NSCQ v2. This chapter provides the necessary additional instructions when changing driver branches requiring different versions of NSCQ.

Switching from pre-R515 to R515+

To switch from an installed driver branch R510 and earlier to R515 or later:

  1. Update to the latest DGX EL8 to get NVSM 22.09.3 or higher $ sudo dnf update -y

  2. Switch to the R515+ driver branch using the following steps: DGX Software on Red Hat Enterprise Linux 8 Installation Guide

  3. Exclude installing DCGM from the DGX EL8 repo:

    Copy
    Copied!
                

    $ grep -q 'exclude=datacenter-gpu-manager' /etc/yum.repos.d/nvidia.repo || $ sudo sed -i '/priority=40/a exclude=datacenter-gpu-manager' /etc/yum.repos.d/nvidia.repo

  4. Install the latest DCGM 3.x from the CUDA repo $ sudo dnf install datacenter-gpu-manager

Switching from R515+ to pre-R515

To switch from an installed driver branch R515 and later to R510 or older:

  1. Switch to the pre-R515 driver branch using the following steps:

    https://docs.nvidia.com/dgx/dgx-rhel8-install-guide/changing-driver-branches.html#changing-driver-branches

  2. Allow DCGM to be installed from the DGX EL8 repo:

    Copy
    Copied!
                

    $ sudo sed -i '/exclude=datacenter-gpu-manager/d' /etc/yum.repos.d/nvidia.repo

    sudo sed -i ‘/exclude=datacenter-gpu-manager/d’ /etc/yum.repos.d/nvidia.repo

  3. Downgrade to the latest DCGM 2.x from the DGX EL8 repo:

    Copy
    Copied!
                

    $ sudo dnf downgrade datacenter-gpu-manager

© Copyright 2022-2023, NVIDIA. Last updated on Apr 17, 2023.