Upgrading#

NVIDIA and Red Hat provide updates to the OS in the form of updated software packages between releases with security mitigations and bug fixes.

Caution

If you plan to set up MIG configurations, upgrading the GPU driver to R570 on DGX Station A100 systems is not currently supported. For more information, see DGX Station A100 Fails to Boot After Applying MIG Configurations.

Important

Here is some important information you need to know before upgrading:

An in-place upgrade from Red Hat Linux 8 to Red Hat Linux 9 with the DGX software stack installed is not supported.
Before you install or perform the upgrade, refer to the Release Notes section for the latest Red Hat Linux version, known issues, and workarounds.

To remain at the same RHEL release and prevent incompatibility between Linux kernel and GPU drivers, pin the RHEL release by using the subscription-manager release --set=<release> command. For example, the subscription-manager release --set=9.3 command ties the system to RHEL 9.3.

You should evaluate the available updates in regular intervals and update the system by using the sudo dnf update --nobestcommand.

For a list of the known Common Vulnerabilities and Exposures (CVEs), including those that can be resolved by updating the OS software, refer to the Red Hat Security Updates

Note

You are responsible for upgrading the software on the DGX system to install the updates from these sources.

If updates are available, you can obtain the package upgrades by running:

sudo dnf update -nobest

Upgrades to the NVIDIA Graphics Drivers for Linux requires a restart to complete the kernel upgrade. If you upgrade the NVIDIA Graphics Drivers for Linux without restarting the DGX system, when you run the nvidia-smicommand, an error message is displayed.

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

Upgrading the OS and DGX Software#

This section provides information for upgrading your DGX system and optionally upgrading to a different GPU branch.

Upgrading the Software without Moving to a New Driver Branch#

To upgrade your DGX system with the latest Red Hat Linux upgrades, run the following command:

sudo dnf update -y --nobest

Updating the Software and Moving to a New Driver Branch on non-NVSwitch Systems#

This procedure applies to DGX-1, DGX-2, DGX Station. and DGX Station A100 systems.

Issue the following to remove the current driver package and install the new driver package.

sudo dnf remove -y nv-persistence-mode libnvidia-nscq-<current driver version>

sudo dnf module remove --all -y nvidia-driver

sudo dnf module reset -y nvidia-driver

sudo dnf module install -y nvidia-driver:<new driver version>/{default,src}

sudo dnf install -y nv-persistence-mode libnvidia-nscq-<new driver version>

sudo dnf update -y --nobest

For DGX Station A100 only - Install additional required DGX Station A100 packages. These packages must be installed after the nvidia-driver module.
```
sudo dnf install nvidia-conf-xconfig nv-docker-gpus
```
Reboot the system.
```
sudo reboot
```

Updating the Software and Moving to a New Driver Branch on NVSwitch Systems#

This procedure applies to DGX-2, DGX A100, and DGX A800 systems.

Run the following commands to remove the current driver package and install the new driver package:

sudo dnf remove -y nv-persistence-mode libnvidia-nscq-<current driver version> nvidia-fm-enable

sudo dnf module remove --all -y nvidia-driver

sudo dnf module reset -y nvidia-driver

sudo dnf module install -y nvidia-driver:<new driver version>/{fm,src}

sudo dnf install -y nv-persistence-mode libnvidia-nscq-<new driver version> nvidia-fm-enable

sudo dnf update -y --nobest

Reboot the system.
```
sudo reboot
```

Changing Only the NVIDIA Driver Branch#

To switch driver branches, you must first remove the existing branch before installing the new branch:

Remove and clear the existing stream:

sudo dnf module remove --all nvidia-driver

sudo dnf module reset nvidia-driver

Follow the Install the NVIDIA CUDA driver step in Installing the GPU Driver to install the new driver branch.
If the nvidia-peer-memory-dkms driver is installed it must be reinstalled to match the new driver branch:
```
sudo dnf reinstall -y nvidia-peer-memory-dkms
```

Installing or Upgrading to a Newer CUDA Toolkit Release#

Important

Before you install or upgrade to any CUDA Toolkit release, ensure the release is compatible with the driver that is installed on the system. Refer to CUDA Compatibility for more information and a compatibility matrix.

The CUDA Toolkit is not installed by default. You can manually install a qualified CUDA Toolkit release.

All CUDA Toolkit releases are supported that interoperate with the installed GPU driver. Refer to the release notes to see CUDA Toolkit current release versions.

Checking the Currently Installed CUDA Toolkit Release#

Here is some information about the prerequisite to determine the CUDA Toolkit release that you currently have installed.

Important

The CUDA Toolkit is not installed on DGX servers by default, and if you try to run the following command, no installed package will be listed

Before you install a new CUDA Toolkit release, to check the currently installed release, run the following command:

sudo dnf list installed "cuda-toolkit-*"

The following output shows that CUDA Toolkit 12.0 is installed:

Updating Subscription Management repositories.

Installed Packages
cuda-toolkit-12-0.x86_64 12.0.0-1 @CUDA
cuda-toolkit-12-0-config-common.noarch 12.0.107-1 @CUDA
cuda-toolkit-12-config-common.noarch 12.0.107-1 @CUDA
cuda-toolkit-config-common.noarch

Determining the New Available CUDA Toolkit Releases#

These steps help you determine which new CUDA Toolkit releases are available. To see the new available CUDA Toolkit releases:

sudo dnf search "cuda-toolkit-*"
Updating Subscription Management repositories.
Last metadata expiration check: 1:47:39 ago on Wed 18 Jan 2023 08:10:38 AM PST.
======================================================= Name Matched: cuda-toolkit-* =======================================================
cuda-toolkit-11-7.x86_64 : CUDA Toolkit 11.7 meta-package
cuda-toolkit-11-7-config-common.noarch : Common config package for CUDA Toolkit 11.7.
cuda-toolkit-11-8.x86_64 : CUDA Toolkit 11.8 meta-package
cuda-toolkit-11-8-config-common.noarch : Common config package for CUDA Toolkit 11.8.
cuda-toolkit-11-config-common.noarch : Common config package for CUDA Toolkit 11.
cuda-toolkit-12-0.x86_64 : CUDA Toolkit 12.0 meta-package
cuda-toolkit-12-0-config-common.noarch : Common config package for CUDA Toolkit 12.0.
cuda-toolkit-12-config-common.noarch : Common config package for CUDA Toolkit 12.
cuda-toolkit-config-common.noarch : Common config package for CUDA Toolkit.

The output shows that 11.7, 11.8, and 12.0 are the possible CUDA Toolkit versions that can be installed.

Installing the CUDA Toolkit or Upgrading Your CUDA Toolkit to a Newer Release#

You can install or upgrade your CUDA Toolkit to a newer release.

To install or upgrade the CUDA Toolkit, run the following command:

sudo dnf install cuda-toolkit-12-0

Note

Version 12.0 is shown as an example - replace the value with the version you wish to install.

Upgrading DCGM to a More Recent Major Release#

If you want to upgrade datacenter-gpu-manager (DCGM) to a new major version (for instance, DCGM 3.x.x to DCGM 4.x.x), you must remove any current installations of the datacenter-gpu-manager and datacenter-manager-config packages and then install the more recent release of these packages as instructed in the Installation section of the NVIDIA® Data Center GPU Manager (DCGM) User Guide.

Installing GPUDirect Storage Support#

NVIDIA® Magnum IO GPUDirect® Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU.

Installing nvidia-gds#

To use GDS, perform the following steps:

Populate the ${NVIDIA_DRV_VERSION}variable

Install nvidia-gdswith the correct dependencies:

sudo install nvidia-gds-{ver} nvidia-dkms-${NVIDIA_DRV_VERSION}-server

Use the CUDA Toolkit version number in place of <ver>; for example, 12-0