Upgrading the OS and DGX Software
NVIDIA and Red Hat provide updates to the OS in the form of updated software packages between releases with security mitigations and bug fixes.
You should evaluate the available updates in regular intervals and update the system by using the
sudo dnf update --nobest command.
For a list of the known Common Vulnerabilities and Exposures (CVEs), including those that can be resolved by updating the OS software, refer to the Red Hat Security Updates
You are responsible for upgrading the software on the DGX system to install the updates from these sources.
Before you upgrade, refer to the NVIDIA DGX Software for Red Hat Enterprise Linux 8 Release Notes for information, known issues, and workarounds.
If updates are available, you can obtain the package upgrades by running the following command:
sudo dnf update -nobest
Upgrades to the NVIDIA Graphics Drivers for Linux requires a restart to complete the kernel upgrade.
If you upgrade the NVIDIA Graphics Drivers for Linux without restarting the DGX system, when you run the
nvidia-smi command, an error message like the following example is displayed.
Failed to initialize NVML: Driver/library version mismatch
Installing or Upgrading to a Newer CUDA Toolkit Release
Only DGX Station, DGX Station A100, and DGX Station A800 have a CUDA Toolkit release installed by default. DGX servers are intended to be shared resources that use containers and do not have CUDA Toolkit installed by default. However, you have the option to install a qualified CUDA Toolkit release.
Although the DGX OS supports all CUDA Toolkit releases that interoperate with the installed driver, DGX OS releases might include a default CUDA Toolkit release that might not be the most recently released version. Unless you must use a new CUDA Toolkit version that contains the new features, we recommend that you remain on the default version that is included in the DGX OS release. Refer to the NVIDIA DGX Software for Red Hat Enterprise Linux 8 Release Notes for the default CUDA Toolkit release.
Before you install or upgrade to any CUDA Toolkit release, ensure the release is compatible with the driver that is installed on the system. Refer to CUDA Compatibility for more information and a compatibility matrix.
Checking the Currently Installed CUDA Toolkit Release
Here is some information about the prerequisite to determine the CUDA Toolkit release that you currently have installed.
The CUDA Toolkit is not installed on DGX servers by default, and if you try to run the following command, no installed package will be listed.
Before you install a new CUDA Toolkit release, to check the currently installed release, run the following command:
sudo dnf list --installed cuda-toolkit-*
For example, the following output shows that CUDA Toolkit 11.0 is installed:
Listing... Done cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed] N: There is 1 additional version. Please use the '-a' switch to see it
Determining the New Available CUDA Toolkit Releases
These steps help you determine which new CUDA Toolkit releases are available. Perform the following steps to see the new available CUDA Toolkit releases.
If you have a DGX OS version earlier than 5.3, ensure you have the correct GPG signing keys installed on your system.
Before you continue the upgrade, refer to DGX OS 5 Release Notes for instructions and details.
Update the local database with the latest information from the Red Hat repository.
Show all available CUDA Toolkit releases.
sudo dnf list cuda-toolkit-*
Listing... Done cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed] cuda-toolkit-11-1/unknown,unknown 11.1.1-1 amd64 cuda-toolkit-11-2/unknown,unknown 11.2.1-1 amd64
The output shows that 11.0, 11.1, and 11.2 are the possible CUDA Toolkit versions that can be installed:
Installing the CUDA Toolkit or Upgrading Your CUDA Toolkit to a Newer Release
You can install or upgrade your CUDA Toolkit to a newer release.
To install or upgrade the CUDA Toolkit, run the following command:
sudo dnf install cuda-toolkit-11-2
11.2 is an example. Replace this value with the actual version that you want to install.
Installing GPUDirect Storage Support
NVIDIA Magnum IO GPUDirect Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage. This software avoids a bounce buffer through the CPU.
Prerequisites to Installing GPUDirect Storage Support
If the systems uses the MLNX_OFED driver, enable GDS support as explained in MLNX_OFED Requirements and Installation.
For systems other than NVIDIA DGX-1, DGX-2, and DGX Station, to use the latest GDS version, 12.2.2-1, that is provided by nvidia-fs-dkms-2.17.5-1, you must install an NVIDIA Open GPU Kernel module driver. Refer to Changing NVIDIA Driver Branches.
For NVIDIA DGX-1, DGX-2, and DGX Station running the generic Linux Kernel, the GPUs in these systems are not supported with the NVIDIA Open GPU Kernel modules. The GDS versions 12.2.2-1 and higher only support the Open GPU Kernel modules.
For these systems, you must lock the nvidia-fs package to version 2.17.3 or lower and the nvidia-gds package to version 12.2.1-1 or lower.
sudo dnf versionlock add nvidia-fs-0:2.17.3-1 nvidia-gds-0:12.2.1-1
Adding versionlock on: nvidia-fs-0:2.17.3-1.* Adding versionlock on: nvidia-gds-0:12.2.1-1.*
For CUDA 11.5.1 and later, if you plan to use Weka FS or IBM SpectrumScale then you need to run:
This will load the module that supports peerdirect capabilities. It is necessary to run this command after reboot of the system. In order to load the module automatically after every reboot, run the following command:
echo "nvidia-peermem" | sudo tee /etc/modules-load.d/nvidia-peermem.conf
WARNING: nvidia-peer-memory module not loading issue:
DGX os5.1.1 provides nv-peer-mem 1.2 and MLNX_OFED 5.4-188.8.131.52 to resolve an issue discovered in MLNX_OFED 5.4-184.108.40.206 nv-peer-mem 1.2 is not compatible with MLNX_OFED <= 5.4-220.127.116.11
Attempting to use nv_peer_mem 1.2 with MLNX_OFED <= 5.4-18.104.22.168 will result in the following error message:
root@dgx-02:~# cat /var/lib/dkms/nv_peer_mem/1.2/build.make.log DKMS make.log for nv-peer-mem-1.2 for kernel 5.4.0-92-generic …
If you use MLNX_OFED <= 5.4-22.214.171.124 and have encountered this issue, then it is recommended that you downgrade to nv_peer_mem 1.1
sudo dnf downgrade nvidia-peer-memory-dkms-21-03-0.el modprobe nv_peer_mem
To install GDS, perform the following steps.
Install the nvidia-gds package.
sudo dnf install nvidia-gds
Refer to Verifying a Successful GDS Installation in the NVIDIA GPUDirect Storage Installation and Troubleshooting Guide.