Release Notes Overview
This document provides important release-specific considerations for upgrading the NVIDIA DGX™ Software Stack on your DGX system that runs the Red Hat Enterprise Linux 8 operating system.
Current Versions
The following are the current versions available.
Product |
Current Version |
---|---|
DGX Station |
|
DGX-2, DGX-1 |
|
DGX A100 |
|
DGX A800 |
|
DGX H100 |
|
DGX Station A100 |
|
DGX Station A800 |
Installing the DGX Software Stack on Red Hat Enterprise Linux 8
Caution
A dnf update
might cause Linux kernel incompatibility with the currently
installed MLNX_OFED networking drivers. To prevent this issue, you can do either of
the following tasks:
Before upgrading the Linux kernel, check the Linux kernel upgrade version without actually upgrading by using the
dnf update --setopt tsflags=test
command with the NVIDIA MLNX_OFED supported Linux version.When upgrading the Linux kernel, consider upgrading MLNX_OFED to a version that supports the new kernel.
Important
Before you install or perform the upgrade, refer to the EL8-23.08 section for information about the latest release.
To install the software on a fresh DGX system, see the DGX Software for Red Hat Enterprise Linux 8 - Installation Guide
Upgrading the DGX Software Stack and Red Hat Enterprise Linux 8
This section provides information about how to update your DGX system while remaining on the same GPU driver branch and how to update your DGX system while switching to a different GPU driver branch.
Important
Here is some important information you need to know before upgrading:
An in-place upgrade from RHEL 7 to RHEL 8 with the DGX software stack installed is not supported.
Before you install or perform the upgrade, refer to the section in this release notes for the latest RHEL version.
Upgrading to a different driver package can result in the server failing to boot. Follow the instructions to first uninstall the current driver.
Ensure that you are prepared to restore the
GRUB_CMDLINE_LINUX
setting as directed in the instructions in this section.
Upgrading the Software without Moving to a New Driver Branch
To update your DGX system with the latest RHEL-8 updates, run the following command:
sudo dnf update -y --nobest
Upgrading and Moving to a New Driver Branch on non-NVSwitch Systems
This procedure applies to the DGX-1, DGX A100/A800, DGX Station A100, and DGX Station A800 systems.
Important
Before you install or perform the upgrade, refer to the EL8-23.08 section for information about the latest release.
Preserve the
GRUB_CMDLINE_LINUX
setting.Note down the existing
GRUB_CMDLINE_LINUX
setting in theetc/default/grub
file.Example:
GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet rd.driver.blacklist=nouveau"
The
"rd.driver.blacklist=nouveau"
parameter was added when installing the driver and should not be included in the restoration.Issue the following to remove the current driver package and install the new driver package.
sudo dnf remove -y nv-persistence-mode libnvidia-nscq-<current driver version> sudo dnf module remove --all -y nvidia-driver sudo dnf module reset -y nvidia-driver sudo dnf module install -y nvidia-driver:<new driver version>/{default,src} sudo dnf install -y nv-persistence-mode libnvidia-nscq-<new driver version> sudo dnf update -y --nobest
Restore the
GRUB_CMDLINE_LINUX
setting.In the
/etc/default/grub
file, remove extra instances ofGRUB_CMDLINE_LINUX
and manually edit the file to restore the original setting (except for the blacklist parameter).Example:
GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet"
Reboot the system.
sudo reboot
Updating and Moving to a New Driver Branch on NVSwitch Systems
This procedure applies to the DGX-2 or DGX A100/A800 systems.
Important
Before you install or perform the upgrade, refer to the EL8-23.08 section for information about the latest release.
Preserve the
GRUB_CMDLINE_LINUX
setting.Note the existing
GRUB_CMDLINE_LINUX
setting in theetc/default/grub
file.Example:
GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet rd.driver.blacklist=nouveau"
The
rd.driver.blacklist=nouveau
parameter was added when installing the driver and should not be included in the restoration.Issue the following to remove the current driver package and install the new driver package.
sudo dnf remove -y nv-persistence-mode nvidia-fm-enable sudo dnf module remove --all -y nvidia-driver sudo dnf module reset -y nvidia-driver sudo dnf module install -y nvidia-driver:<new driver version>/{fm,src} sudo dnf install -y nv-persistence-mode nvidia-fm-enable sudo dnf update -y --nobest
Restore the
GRUB_CMDLINE_LINUX
setting.In the
/etc/default/grub
file, remove any extra instances ofGRUB_CMDLINE_LINUX
and manually edit the file to restore the original setting (except for the blacklist parameter).Example:
GRUB_CMDLINE_LINUX="crashkernel=auto rd.md.uuid=09a9380c:87edd4b6:8f5d9bbc:45e834c7 rhgb quiet"
Reboot the system.
sudo reboot
Switching GPU Driver between pre-R515 and post-R515
While R510 and earlier GPU drivers depend on NSCQ v1, R515 and later require NSCQ v2. This chapter provides the necessary additional instructions when changing driver branches requiring different versions of NSCQ.
Switching from pre-R515 to R515+
To switch from an installed driver branch R510 and earlier to R515 or later:
Update to the latest DGX EL8 to get NVSM 22.09.3 or higher
$ sudo dnf update -y
Switch to the R515+ driver branch using the following steps: DGX Software on Red Hat Enterprise Linux 8 Installation Guide
Exclude installing DCGM from the DGX EL8 repo:
$ grep -q 'exclude=datacenter-gpu-manager' /etc/yum.repos.d/nvidia.repo || $ sudo sed -i '/priority=40/a exclude=datacenter-gpu-manager' /etc/yum.repos.d/nvidia.repo
Install the latest DCGM 3.x from the CUDA repo
$ sudo dnf install datacenter-gpu-manager
Switching from R515+ to pre-R515
To switch from an installed driver branch R515 and later to R510 or older:
Switch to the pre-R515 driver branch using the following steps:https://docs.nvidia.com/dgx/dgx-rhel8-install-guide/changing-driver-branches.html#changing-driver-branches
Allow DCGM to be installed from the DGX EL8 repo:
$ sudo sed -i '/exclude=datacenter-gpu-manager/d' /etc/yum.repos.d/nvidia.repo sudo sed -i '/exclude=datacenter-gpu-manager/d' /etc/yum.repos.d/nvidia.repo
Downgrade to the latest DCGM 2.x from the DGX EL8 repo:
$ sudo dnf downgrade datacenter-gpu-manager