Upgrading the OS and DGX Software

NVIDIA and Red Hat provide updates to the OS in the form of updated software packages between releases with security mitigations and bug fixes.

You should evaluate the available updates in regular intervals and update the system by using the sudo dnf update --nobest command.

For a list of the known Common Vulnerabilities and Exposures (CVEs), including those that can be resolved by updating the OS software, refer to the Red Hat Security Updates

  Important: You are responsible for upgrading the software on the DGX system to install the updates from these sources.

If updates are available, you can obtain the package upgrades by completing issuing the command

$ sudo dnf update -nobest

Upgrades to the NVIDIA Graphics Drivers for Linux requires a restart to complete the kernel upgrade. If you upgrade the NVIDIA Graphics Drivers for Linux without restarting the DGX system, when you run the nvidia-smi command, an error message is displayed.

$ nvidia-smi

Failed to initialize NVML: Driver/library version mismatch

Installing or Upgrading to a Newer CUDA Toolkit Release

Only DGX Station and DGX Station A100 have a CUDA Toolkit release installed by default. DGX servers are intended to be shared resources that use containers and do not have CUDA Toolkit installed by default. However, you have the option to install a qualified CUDA Toolkit release.

Although the DGX OS supports all CUDA Toolkit releases that interoperate with the installed driver, DGX OS releases might include a default CUDA Toolkit release that might not be the most recently released version. Unless you must use a new CUDA Toolkit version that contains the new features, we recommend that you remain on the default version that is included in

the DGX OS release. Refer to the DGX OS Software Release Notes for the default CUDA Toolkit release.

  Important: Before you install or upgrade to any CUDA Toolkit release, ensure the release is compatible with the driver that is installed on the system. Refer to CUDA Compatibility for more information and a compatibility matrix.


Checking the Currently Installed CUDA Toolkit Release

Here is some information about the prerequisite to determine the CUDA Toolkit release that you currently have installed.

  Important: The CUDA Toolkit is not installed on DGX servers by default, and if you try to run the following command, no installed package will be listed.

Before you install a new CUDA Toolkit release, to check the currently installed release, run the following command:

$ sudo list --installed cuda-toolkit-*

For example, the following output shows that CUDA Toolkit 11.0 is installed:

$ sudo list --installed cuda-toolkit-*

Listing... Done

cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed]

N: There is 1 additional version. Please use the '-a' switch to see it



Determining the New Available CUDA Toolkit Releases

These steps help you determine which new CUDA Toolkit releases are available. To see the new available CUDA Toolkit releases:

  1. If you have a DGX OS version earlier than 5.3, ensure you have the correct GPG signing keys installed on your system.

    Before you continue the upgrade, refer to DGX OS 5 Release Notes for instructions and details.

  2. Update the local database with the latest information from the Red Hat repository.

$ sudo update

  1. Show all available CUDA Toolkit releases.

$ sudo dnf list cuda-toolkit-*

Listing... Done

cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed] cuda-toolkit-11-1/unknown,unknown 11.1.1-1 amd64

cuda-toolkit-11-2/unknown,unknown 11.2.1-1 amd64

The output shows that 11.0, 11.1, and 11.2 are the possible CUDA Toolkit versions that can be installed:



Installing the CUDA Toolkit or Upgrading Your CUDA Toolkit to a Newer Release

You can install or upgrade your CUDA Toolkit to a newer release.

To install or upgrade the CUDA Toolkit, run the following command:

$ sudo dnf install cuda-toolkit-11-2

  Important: Here, version 11.2 is an example, and you should replace this value with the actual version that you want to install.


Installing GPUDirect Storage Support

NVIDIA® Magnum IO GPUDirect® Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU.



Prerequisites to Installing GPUDirect Storage Support

Using the MLNX_OFED Driver If using the MLNX_OFED driver, be sure to enable GDS support as explained in https:// docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#mofed-req-install

Installing nvidia_peermem

For CUDA 11.5.1 and later, if you plan to use Weka FS or IBM SpectrumScale then you need to run:

modprobe nvidia_peermem

This will load the module that supports peerdirect capabilities. It is necessary to run this command after reboot of the system. In order to load the module automatically after every reboot, run the following command:

echo "nvidia-peermem" | sudo tee /etc/modules-load.d/nvidia-peermem.conf

WARNING: nvidia-peer-memory module not loading issue:

DGX os5.1.1 provides nv-peer-mem 1.2 and MLNX_OFED 5.4-3.1.0.0 to resolve an issue discovered in MLNX_OFED 5.4-1.0.3.0 nv-peer-mem 1.2 is not compatible with MLNX_OFED <= 5.4-1.0.3.0

Attempting to use nv_peer_mem 1.2 with MLNX_OFED <= 5.4-1.0.3.0 will result in the following error message:

root@dgx-02:~# cat /var/lib/dkms/nv_peer_mem/1.2/build.make.log DKMS make.log for nv-peer-mem-1.2 for kernel 5.4.0-92-generic ...

If you use MLNX_OFED <= 5.4-1.0.3.0 and have encountered this issue, then it is recommended that you downgrade to nv_peer_mem 1.1

$ sudo dnf downgrade nvidia-peer-memory-dkms-21-03-0.el modprobe nv_peer_mem



Installing nvidia-gds

To use GDS, perform the following.install nvidia-gds with the correct dependencies.

  1. Populate the ${NVIDIA_DRV_VERSION} variable.

  1. Install nvidia-gds with the correct dependencies.

$ sudo install nvidia-gds-<ver> nvidia-dkms-${NVIDIA_DRV_VERSION}-server

Use the CUDA Toolkit version number in place of <ver>; for example, 11-4.