Upgrading the OS and DGX Software
NVIDIA and Red Hat provide updates to the OS in the form of updated software packages between releases with security mitigations and bug fixes.
You should evaluate the available updates in regular intervals and update the system by using the sudo dnf update --nobest command.
For a list of the known Common Vulnerabilities and Exposures (CVEs), including those that can be resolved by updating the OS software, refer to the Red Hat Security Updates
Important: You are responsible for upgrading the software on the DGX system to install the updates from these sources. |
If updates are available, you can obtain the package upgrades by completing issuing the command
$ sudo dnf update -nobest
Upgrades to the NVIDIA Graphics Drivers for Linux requires a restart to complete the kernel upgrade. If you upgrade the NVIDIA Graphics Drivers for Linux without restarting the DGX system, when you run the nvidia-smi command, an error message is displayed.
$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
Installing or Upgrading to a Newer CUDA Toolkit Release
Only DGX Station and DGX Station A100 have a CUDA Toolkit release installed by default. DGX servers are intended to be shared resources that use containers and do not have CUDA Toolkit installed by default. However, you have the option to install a qualified CUDA Toolkit release.
Although the DGX OS supports all CUDA Toolkit releases that interoperate with the installed driver, DGX OS releases might include a default CUDA Toolkit release that might not be the most recently released version. Unless you must use a new CUDA Toolkit version that contains the new features, we recommend that you remain on the default version that is included in
the DGX OS release. Refer to the DGX OS Software Release Notes for the default CUDA Toolkit release.
Checking the Currently Installed CUDA Toolkit Release
Here is some information about the prerequisite to determine the CUDA Toolkit release that you currently have installed.
Important: The CUDA Toolkit is not installed on DGX servers by default, and if you try to run the following command, no installed package will be listed. |
Before you install a new CUDA Toolkit release, to check the currently installed release, run the following command:
$ sudo list --installed cuda-toolkit-*
For example, the following output shows that CUDA Toolkit 11.0 is installed:
$ sudo list --installed cuda-toolkit-*
Listing... Done
cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed]
N: There is 1 additional version. Please use the '-a' switch to see it
Determining the New Available CUDA Toolkit Releases
These steps help you determine which new CUDA Toolkit releases are available. To see the new available CUDA Toolkit releases:
-
If you have a DGX OS version earlier than 5.3, ensure you have the correct GPG signing keys installed on your system.
Before you continue the upgrade, refer to DGX OS 5 Release Notes for instructions and details.
-
Update the local database with the latest information from the Red Hat repository.
$ sudo update
-
Show all available CUDA Toolkit releases.
$ sudo dnf list cuda-toolkit-*
Listing... Done
cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed] cuda-toolkit-11-1/unknown,unknown 11.1.1-1 amd64
cuda-toolkit-11-2/unknown,unknown 11.2.1-1 amd64
The output shows that 11.0, 11.1, and 11.2 are the possible CUDA Toolkit versions that can be installed:
Installing GPUDirect Storage Support
NVIDIA® Magnum IO GPUDirect® Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU.
Prerequisites to Installing GPUDirect Storage Support
Using the MLNX_OFED Driver If using the MLNX_OFED driver, be sure to enable GDS support as explained in https:// docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#mofed-req-install
Installing nvidia_peermem
For CUDA 11.5.1 and later, if you plan to use Weka FS or IBM SpectrumScale then you need to run:
modprobe nvidia_peermem
This will load the module that supports peerdirect capabilities. It is necessary to run this command after reboot of the system. In order to load the module automatically after every reboot, run the following command:
echo "nvidia-peermem" | sudo tee /etc/modules-load.d/nvidia-peermem.conf
WARNING: nvidia-peer-memory module not loading issue:
DGX os5.1.1 provides nv-peer-mem 1.2 and MLNX_OFED 5.4-3.1.0.0 to resolve an issue discovered in MLNX_OFED 5.4-1.0.3.0 nv-peer-mem 1.2 is not compatible with MLNX_OFED <= 5.4-1.0.3.0
Attempting to use nv_peer_mem 1.2 with MLNX_OFED <= 5.4-1.0.3.0 will result in the following error message:
root@dgx-02:~# cat /var/lib/dkms/nv_peer_mem/1.2/build.make.log DKMS make.log for nv-peer-mem-1.2 for kernel 5.4.0-92-generic ...
If you use MLNX_OFED <= 5.4-1.0.3.0 and have encountered this issue, then it is recommended that you downgrade to nv_peer_mem 1.1
$ sudo dnf downgrade nvidia-peer-memory-dkms-21-03-0.el modprobe nv_peer_mem
Installing nvidia-gds
To use GDS, perform the following.install nvidia-gds with the correct dependencies.
-
Populate the ${NVIDIA_DRV_VERSION} variable.
-
Install nvidia-gds with the correct dependencies.
$ sudo install nvidia-gds-<ver> nvidia-dkms-${NVIDIA_DRV_VERSION}-server
Use the CUDA Toolkit version number in place of <ver>; for example, 11-4.