Upgrading the OS and DGX Software

NVIDIA and Red Hat provide updates to the OS in the form of updated software packages between releases with security mitigations and bug fixes.

You should evaluate the available updates in regular intervals and update the system by using the sudo dnf update --nobest command.

Caution

A dnf update might cause Linux kernel incompatibility with the currently installed MLNX_OFED networking drivers. To prevent this issue, you can do either of the following tasks:

  • Before upgrading the Linux kernel, check the Linux kernel upgrade version without actually upgrading by using the dnf update --setopt tsflags=test command with the NVIDIA MLNX_OFED supported Linux version.

  • When upgrading the Linux kernel, consider upgrading MLNX_OFED to a version that supports the new kernel.

For a list of the known Common Vulnerabilities and Exposures (CVEs), including those that can be resolved by updating the OS software, refer to the Red Hat Security Updates

Important

You are responsible for upgrading the software on the DGX system to install the updates from these sources.

Before you upgrade, refer to the NVIDIA DGX Software for Red Hat Enterprise Linux 8 Release Notes for information, known issues, and workarounds.

If updates are available, you can obtain the package upgrades by running the following command:

sudo dnf update -nobest

Upgrades to the NVIDIA Graphics Drivers for Linux requires a restart to complete the kernel upgrade. If you upgrade the NVIDIA Graphics Drivers for Linux without restarting the DGX system, when you run the nvidia-smi command, an error message like the following example is displayed.

Failed to initialize NVML: Driver/library version mismatch

Installing or Upgrading to a Newer CUDA Toolkit Release

Only DGX Station, DGX Station A100, and DGX Station A800 have a CUDA Toolkit release installed by default. DGX servers are intended to be shared resources that use containers and do not have CUDA Toolkit installed by default. However, you have the option to install a qualified CUDA Toolkit release.

Although the DGX OS supports all CUDA Toolkit releases that interoperate with the installed driver, DGX OS releases might include a default CUDA Toolkit release that might not be the most recently released version. Unless you must use a new CUDA Toolkit version that contains the new features, we recommend that you remain on the default version that is included in the DGX OS release. Refer to the NVIDIA DGX Software for Red Hat Enterprise Linux 8 Release Notes for the default CUDA Toolkit release.

Important

Before you install or upgrade to any CUDA Toolkit release, ensure the release is compatible with the driver that is installed on the system. Refer to CUDA Compatibility for more information and a compatibility matrix.

Checking the Currently Installed CUDA Toolkit Release

Here is some information about the prerequisite to determine the CUDA Toolkit release that you currently have installed.

Important

The CUDA Toolkit is not installed on DGX servers by default, and if you try to run the following command, no installed package will be listed.

Before you install a new CUDA Toolkit release, to check the currently installed release, run the following command:

sudo dnf list --installed cuda-toolkit-*

For example, the following output shows that CUDA Toolkit 11.0 is installed:

Listing... Done

cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed]

N: There is 1 additional version. Please use the '-a' switch to see it

Determining the New Available CUDA Toolkit Releases

These steps help you determine which new CUDA Toolkit releases are available. Perform the following steps to see the new available CUDA Toolkit releases.

  1. If you have a DGX OS version earlier than 5.3, ensure you have the correct GPG signing keys installed on your system.

    Before you continue the upgrade, refer to DGX OS 5 Release Notes for instructions and details.

  2. Update the local database with the latest information from the Red Hat repository.

    sudo update
    
  3. Show all available CUDA Toolkit releases.

    sudo dnf list cuda-toolkit-*
    

    Example Output

    Listing... Done
    
    cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed]
    cuda-toolkit-11-1/unknown,unknown 11.1.1-1 amd64
    cuda-toolkit-11-2/unknown,unknown 11.2.1-1 amd64
    

    The output shows that 11.0, 11.1, and 11.2 are the possible CUDA Toolkit versions that can be installed:

Installing the CUDA Toolkit or Upgrading Your CUDA Toolkit to a Newer Release

You can install or upgrade your CUDA Toolkit to a newer release.

To install or upgrade the CUDA Toolkit, run the following command:

sudo dnf install cuda-toolkit-11-2

Important

Version 11.2 is an example. Replace this value with the actual version that you want to install.

Installing GPUDirect Storage Support

NVIDIA Magnum IO GPUDirect Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage. This software avoids a bounce buffer through the CPU.

Prerequisites to Installing GPUDirect Storage Support

  • If the systems uses the MLNX_OFED driver, enable GDS support as explained in MLNX_OFED Requirements and Installation.

  • For systems other than NVIDIA DGX-1, DGX-2, and DGX Station, to use the latest GDS version, 12.2.2-1, that is provided by nvidia-fs-dkms-2.17.5-1, you must install an NVIDIA Open GPU Kernel module driver. Refer to Changing NVIDIA Driver Branches.

  • For NVIDIA DGX-1, DGX-2, and DGX Station running the generic Linux Kernel, the GPUs in these systems are not supported with the NVIDIA Open GPU Kernel modules. The GDS versions 12.2.2-1 and higher only support the Open GPU Kernel modules.

    For these systems, you must lock the nvidia-fs package to version 2.17.3 or lower and the nvidia-gds package to version 12.2.1-1 or lower.

    sudo dnf versionlock add nvidia-fs-0:2.17.3-1 nvidia-gds-0:12.2.1-1
    

    Example Output

    Adding versionlock on: nvidia-fs-0:2.17.3-1.*
    Adding versionlock on: nvidia-gds-0:12.2.1-1.*
    

Installing nvidia_peermem

For CUDA 11.5.1 and later, if you plan to use Weka FS or IBM SpectrumScale then you need to run:

modprobe nvidia_peermem

This will load the module that supports peerdirect capabilities. It is necessary to run this command after reboot of the system. In order to load the module automatically after every reboot, run the following command:

echo "nvidia-peermem" | sudo tee /etc/modules-load.d/nvidia-peermem.conf

WARNING: nvidia-peer-memory module not loading issue:

DGX os5.1.1 provides nv-peer-mem 1.2 and MLNX_OFED 5.4-3.1.0.0 to resolve an issue discovered in MLNX_OFED 5.4-1.0.3.0 nv-peer-mem 1.2 is not compatible with MLNX_OFED <= 5.4-1.0.3.0

Attempting to use nv_peer_mem 1.2 with MLNX_OFED <= 5.4-1.0.3.0 will result in the following error message:

root@dgx-02:~# cat /var/lib/dkms/nv_peer_mem/1.2/build.make.log DKMS make.log for nv-peer-mem-1.2 for kernel 5.4.0-92-generic …

If you use MLNX_OFED <= 5.4-1.0.3.0 and have encountered this issue, then it is recommended that you downgrade to nv_peer_mem 1.1

sudo dnf downgrade nvidia-peer-memory-dkms-21-03-0.el modprobe nv_peer_mem

Installing nvidia-gds

To install GDS, perform the following steps.

  • Install the nvidia-gds package.

    sudo dnf install nvidia-gds
    

Refer to Verifying a Successful GDS Installation in the NVIDIA GPUDirect Storage Installation and Troubleshooting Guide.