Upgrading

NVIDIA and Red Hat provide updates to the OS in the form of updated software packages between releases with security mitigations and bug fixes.

Important

Here is some important information you need to know before upgrading:

  • An in-place upgrade from Red Hat Linux 8 to Red Hat Linux 9 with the DGX software stack installed is not supported.

  • Before you install or perform the upgrade, refer to the Release Notes section for the latest Red Hat Linux version, known issues, and workarounds.

    To remain at the same RHEL release and prevent incompatibility between Linux kernel and GPU drivers, pin the RHEL release by using the subscription-manager release --set=<release> command. For example, the subscription-manager release --set=9.3 command ties the system to RHEL 9.3.

You should evaluate the available updates in regular intervals and update the system by using the sudo dnf update --nobestcommand.

For a list of the known Common Vulnerabilities and Exposures (CVEs), including those that can be resolved by updating the OS software, refer to the Red Hat Security Updates

Note

You are responsible for upgrading the software on the DGX system to install the updates from these sources.

If updates are available, you can obtain the package upgrades by running:

sudo dnf update -nobest

Upgrades to the NVIDIA Graphics Drivers for Linux requires a restart to complete the kernel upgrade. If you upgrade the NVIDIA Graphics Drivers for Linux without restarting the DGX system, when you run the nvidia-smicommand, an error message is displayed.

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

Upgrading the OS and DGX Software

This section provides information for upgrading your DGX system and optionally upgrading to a different GPU branch.

Upgrading the Software without Moving to a New Driver Branch

To upgrade your DGX system with the latest Red Hat Linux upgrades, run the following command:

sudo dnf update -y --nobest

Updating the Software and Moving to a New Driver Branch on non-NVSwitch Systems

This procedure applies to DGX-1, DGX-2, DGX Station. and DGX Station A100 systems.

  1. Issue the following to remove the current driver package and install the new driver package.

sudo dnf remove -y nv-persistence-mode libnvidia-nscq-<current driver version>
sudo dnf module remove --all -y nvidia-driver
sudo dnf module reset -y nvidia-driver
sudo dnf module install -y nvidia-driver:<new driver version>/{default,src}
sudo dnf install -y nv-persistence-mode libnvidia-nscq-<new driver version>
sudo dnf update -y --nobest
  1. For DGX Station A100 only - Install additional required DGX Station A100 packages. These packages must be installed after the nvidia-driver module. sudo dnf install nvidia-conf-xconfig nv-docker-gpus

    sudo dnf install nvidia-conf-xconfig nv-docker-gpus
    
  2. Reboot the system.

    sudo reboot
    

Updating the Software and Moving to a New Driver Branch on NVSwitch Systems

This procedure applies to DGX-2, DGX A100, and DGX A800 systems.

  1. Run the following commands to remove the current driver package and install the new driver package:

sudo dnf remove -y nv-persistence-mode libnvidia-nscq-<current driver version> nvidia-fm-enable
sudo dnf module remove --all -y nvidia-driver
sudo dnf module reset -y nvidia-driver
sudo dnf module install -y nvidia-driver:<new driver version>/{fm,src}
sudo dnf install -y nv-persistence-mode libnvidia-nscq-<new driver version> nvidia-fm-enable
sudo dnf update -y --nobest
  1. Reboot the system.

    sudo reboot
    

Changing only the NVIDIA Driver Branch

To switch driver branches, you must first remove the existing branch before installing the new branch:

  1. Remove and clear the existing stream:

    sudo dnf module remove --all nvidia-driver
    sudo dnf module reset nvidia-driver
    
  2. Follow the “Install NVIDIA CUDA driver” section to install the new driver branch.

  3. If the nvidia-peer-memory-dkms driver is installed it must be reinstalled to match the new driver branch:

    sudo dnf reinstall -y nvidia-peer-memory-dkms
    

Installing or Upgrading to a Newer CUDA Toolkit Release

Important

Before you install or upgrade to any CUDA Toolkit release, ensure the release is compatible with the driver that is installed on the system. Refer to CUDA Compatibility for more information and a compatibility matrix.

Only DGX Station and DGX Station A100 have a CUDA Toolkit release installed by default. DGX servers are intended to be shared resources that use containers and do not have CUDA Toolkit installed by default. However, you have the option to install a qualified CUDA Toolkit release.

Although all CUDA Toolkit releases are supported that interoperate with the installed driver, DGX releases might include a default CUDA Toolkit release that might not be the most recently released version. Unless you must use a new CUDA Toolkit version that contains the new features, we recommend that you remain on the default version that is included in the DGX RHEL9 release. Refer to the Release Notes for the default CUDA Toolkit release.

Checking the Currently Installed CUDA Toolkit Release

Here is some information about the prerequisite to determine the CUDA Toolkit release that you currently have installed.

Important

The CUDA Toolkit is not installed on DGX servers by default, and if you try to run the following command, no installed package will be listed

Before you install a new CUDA Toolkit release, to check the currently installed release, run the following command:

sudo dnf list installed "cuda-toolkit-*"

The following output shows that CUDA Toolkit 12.0 is installed:

Updating Subscription Management repositories.

Installed Packages
cuda-toolkit-12-0.x86_64 12.0.0-1 @CUDA
cuda-toolkit-12-0-config-common.noarch 12.0.107-1 @CUDA
cuda-toolkit-12-config-common.noarch 12.0.107-1 @CUDA
cuda-toolkit-config-common.noarch

Determining the New Available CUDA Toolkit Releases

These steps help you determine which new CUDA Toolkit releases are available. To see the new available CUDA Toolkit releases:

sudo dnf search "cuda-toolkit-*"
Updating Subscription Management repositories.
Last metadata expiration check: 1:47:39 ago on Wed 18 Jan 2023 08:10:38 AM PST.
======================================================= Name Matched: cuda-toolkit-* =======================================================
cuda-toolkit-11-7.x86_64 : CUDA Toolkit 11.7 meta-package
cuda-toolkit-11-7-config-common.noarch : Common config package for CUDA Toolkit 11.7.
cuda-toolkit-11-8.x86_64 : CUDA Toolkit 11.8 meta-package
cuda-toolkit-11-8-config-common.noarch : Common config package for CUDA Toolkit 11.8.
cuda-toolkit-11-config-common.noarch : Common config package for CUDA Toolkit 11.
cuda-toolkit-12-0.x86_64 : CUDA Toolkit 12.0 meta-package
cuda-toolkit-12-0-config-common.noarch : Common config package for CUDA Toolkit 12.0.
cuda-toolkit-12-config-common.noarch : Common config package for CUDA Toolkit 12.
cuda-toolkit-config-common.noarch : Common config package for CUDA Toolkit.

The output shows that 11.7, 11.8, and 12.0 are the possible CUDA Toolkit versions that can be installed.

Installing the CUDA Toolkit or Upgrading Your CUDA Toolkit to a Newer Release

You can install or upgrade your CUDA Toolkit to a newer release.

To install or upgrade the CUDA Toolkit, run the following command:

sudo dnf install cuda-toolkit-12-0

Note

Version 12.0 is shown as an example - replace the value with the version you wish to install.

Installing GPUDirect Storage Support

NVIDIA® Magnum IO GPUDirect® Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU.

Installing nvidia-gds

To use GDS, perform the following steps:

  1. Populate the ${NVIDIA_DRV_VERSION}variable

  2. Install nvidia-gdswith the correct dependencies:

    sudo install nvidia-gds-{ver} nvidia-dkms-${NVIDIA_DRV_VERSION}-server
    

Use the CUDA Toolkit version number in place of <ver>; for example, 12-0