Upgrading#
NVIDIA and Red Hat provide updates to the OS in the form of updated software packages between releases with security mitigations and bug fixes.
Important
Here is some important information you need to know before upgrading:
An in-place upgrade from Red Hat Linux 8 to Red Hat Linux 9 with the DGX software stack installed is not supported.
Before you install or perform the upgrade, refer to the Release Notes section for the latest Red Hat Linux version, known issues, and workarounds.
To remain at the same RHEL release and prevent incompatibility between Linux kernel and GPU drivers, pin the RHEL release by using the
subscription-manager release --set=<release>command. For example, thesubscription-manager release --set=9.7command ties the system to RHEL 9.7.
You should evaluate the available updates in regular intervals and update the system by using the
sudo dnf update --nobestcommand.
For a list of the known Common Vulnerabilities and Exposures (CVEs), including those that can be resolved by updating the OS software, refer to the Red Hat Security Updates
Note
You are responsible for upgrading the software on the DGX system to install the updates from these sources.
Upgrading the OS and DGX Software#
To upgrade your DGX system with the latest Red Hat Linux upgrades, run the following command:
sudo dnf update -y --nobest
Note
This will upgrade installed software without upgrading the NVIDIA GPU Driver.
To ensure the packages provided by DGX software are current, install the current DGX Tools and Configuration Files as instructed in Step 4 of Installing Required Components.
For example, if you are upgrading DGX software on a DGX H100, install DGX H100 Configurations:
sudo dnf group install -y 'DGX H100 Configurations'
Upgrading the NVIDIA Graphics Drivers for Linux#
To upgrade the NVIDIA GPU Driver, follow the instructions in one of the next three sections, as appropriate for your environment.
After you upgrade the NVIDIA GPU Driver, a system restart is required to complete the kernel upgrade.
If you upgrade the NVIDIA GPU Driver without restarting the DGX system, when you run
the nvidia-smi command, an error message similar to the following error message may be displayed.
nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
Updating the Software and Moving to a New NVIDIA GPU Driver Branch on non-NVSwitch Systems#
This procedure applies to DGX Station A100 systems.
Uninstall the NVIDIA GPU Driver and associated packages:
sudo dnf remove -y nv-persistence-mode libnvidia-nscq-<current driver version>
sudo dnf module remove --all -y nvidia-driver
sudo dnf module reset -y nvidia-driver
Install the desired branch of the NVIDIA GPU Driver by following the instructions in Installing the GPU Driver.
Reboot the system.
sudo reboot
Updating the Software and Moving to a New NVIDIA GPU Driver Branch on NVSwitch Systems#
This procedure applies to NVSwitch systems, such as DGX B200, DGX B300, DGX H200, DGX H100, DGX A100, and DGX A800.
Uninstall the NVIDIA GPU Driver and associated packages:
sudo dnf remove -y nv-persistence-mode libnvidia-nscq-<current driver version> nvidia-fm-enable
On NVSwitch systems with the fifth-generation NVLinks, such as DGX B200 and DGX B300:
sudo dnf remove -y nvlsm
sudo dnf module remove --all -y nvidia-driver
sudo dnf module reset -y nvidia-driver
Install the desired branch of the NVIDIA GPU Driver by following the instructions in Installing the GPU Driver.
Reboot the system.
sudo reboot
Changing Only the NVIDIA GPU Driver Branch#
To switch driver branches, you must first remove the existing branch before installing the new branch:
Remove and clear the existing stream:
sudo dnf module remove --all nvidia-driver
sudo dnf module reset nvidia-driver
Follow the Install the NVIDIA CUDA driver step in Installing the GPU Driver to install the new driver branch.
If the nvidia-peer-memory-dkms driver is installed it must be reinstalled to match the new driver branch:
sudo dnf reinstall -y nvidia-peer-memory-dkms
Installing or Upgrading to a Newer CUDA Toolkit Release#
Important
Before you install or upgrade to any CUDA Toolkit release, ensure the CUDA Toolkit release is compatible with the driver that is installed on the system. Refer to the NVIDIA CUDA Compatibility documentation, which provides additional compatibility information and a compatibility matrix.
The CUDA Toolkit is not installed by default. You can manually install a qualified CUDA Toolkit release.
All CUDA Toolkit releases are supported that interoperate with the installed GPU driver. Refer to the release notes to see CUDA Toolkit current release versions.
Checking the Currently Installed CUDA Toolkit Release#
Here is some information about the prerequisite to determine the CUDA Toolkit release that you currently have installed.
Important
The CUDA Toolkit is not installed on DGX servers by default, and if you try to run the following command, no installed package will be listed
Before you install a new CUDA Toolkit release, to check the currently installed release, run the following command:
sudo dnf list installed "cuda-toolkit-*"
The output from this command indicates that the CUDA Toolkit is installed and shows the installed CUDA Toolkit version. For example:
Updating Subscription Management repositories.
Installed Packages
cuda-toolkit-13-0.x86_64 <version> @CUDA
cuda-toolkit-13-0-config-common.noarch <version> @CUDA
cuda-toolkit-13-config-common.noarch <version> @CUDA
cuda-toolkit-config-common.noarch <version> @CUDA
Determining the New Available CUDA Toolkit Releases#
These steps help you determine which new CUDA Toolkit releases are available. Run the following command to see the new available CUDA Toolkit releases:
sudo dnf search "cuda-toolkit-*"
The output shows CUDA Toolkit versions that can be installed.
...
=========================== Name Matched: cuda-toolkit-* ===========================
cuda-toolkit-11-7.x86_64 : CUDA Toolkit 11.7 meta-package
cuda-toolkit-11-7-config-common.noarch : Common config package for CUDA Toolkit 11.7.
...
cuda-toolkit-12-9.x86_64 : CUDA Toolkit 12.9 meta-package
cuda-toolkit-12-9-config-common.noarch : Common config package for CUDA Toolkit 12.9.
cuda-toolkit-12-config-common.noarch : Common config package for CUDA Toolkit 12.
cuda-toolkit-13.x86_64 : CUDA Toolkit 13 meta-package
cuda-toolkit-13-0.x86_64 : CUDA Toolkit 13.0 meta-package
cuda-toolkit-13-0-config-common.noarch : Common config package for CUDA Toolkit 13.0.
cuda-toolkit-13-1.x86_64 : CUDA Toolkit 13.1 meta-package
cuda-toolkit-13-1-config-common.noarch : Common config package for CUDA Toolkit 13.1.
cuda-toolkit-13-config-common.noarch : Common config package for CUDA Toolkit 13.
cuda-toolkit-config-common.noarch : Common config package for CUDA Toolkit.
Installing the CUDA Toolkit or Upgrading Your CUDA Toolkit to a Newer Release#
You can install or upgrade your CUDA Toolkit to a newer release. The CUDA Tookit version must be compatible with the NVIDIA GPU Driver version. CUDA Toolkit and NVIDIA GPU Driver compatability versions are listed in Release Notes.
To install or upgrade the CUDA Toolkit, run the following command:
sudo dnf install cuda-toolkit-13-0
Note
Version 13.0 is shown as an example - replace the value with the version you want to install.
Upgrading DCGM to a More Recent Major Release#
If you want to upgrade datacenter-gpu-manager (DCGM) to a new major version (for instance, DCGM 3.x.x
to DCGM 4.x.x), you must remove any current installations of the datacenter-gpu-manager
and datacenter-manager-config packages and then install the more recent release of these packages
as instructed in the Installation section
of the NVIDIA® Data Center GPU Manager (DCGM) User Guide.
Installing GPUDirect Storage Support#
NVIDIA® Magnum IO GPUDirect® Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU.
Installing nvidia-gds#
To use GDS, perform the following steps:
Populate the
${NVIDIA_DRV_VERSION}variableInstall
nvidia-gdswith the correct dependencies:sudo install nvidia-gds-{ver} nvidia-dkms-${NVIDIA_DRV_VERSION}-server
Use the CUDA Toolkit version number in place of <ver>; for example, 13-0