Additional Software#
DGX OS 5 is an optimized version of the Ubuntu 20.04 Linux distribution with
access to a large collection of additional software that is available from the
Ubuntu and NVIDIA repositories. You can install the additional software using
the apt
command or through a graphical tool.
Note
The graphical tool is only available for DGX Station and DGX Station A100.
For more information about additional software available from Ubuntu, refer also to Install additional applications
Before you install additional software or upgrade installed software, refer also to the Release Notes for the latest release information.
Upgrading the System#
Before installing any additional software, you should upgrade the system to the latest versions. This ensures that you have access to new software releases that have been added to the repositories since your last upgrade. Refer to Upgrading for more information and instructions including instructions for enabling Ubuntu’s Extended Security Maintenance updates.
Important
You will only see the latest software branches after upgrading DGX OS.
Note
When you switch between software branches, such as the GPU driver or CUDA toolkit, you have to install the package(s) for the new branch. Depending on the software, it will then remove the existing branch or support concurrent branches installed on a system.
Changing Your GPU Branch#
NVIDIA drivers are released as precompiled and signed kernel modules by Canonical and are available directly from the Ubuntu repository. Signed drivers are required to verify the integrity of driver packages and identity of the vendor.
However, the verification process requires that Canonical build and release the drivers with Ubuntu kernel updates after their release cycle is complete, and this process might sometimes delay new driver branch releases and updates. For more information about the NVIDIA driver release, refer to the release notes at NVIDIA Driver Documentation.
Important
The Ubuntu repositories provide the following versions of the signed and precompiled NVIDIA drivers:
The general NVIDIA display drivers
The NVIDIA Data Center GPU drivers
On your DGX system, only install the packages that include the NVIDIA Data Center GPU drivers.
The metapackages for the NVIDIA Data Center GPU driver have the -server
or -server-open
suffix.
Checking the Currently Installed Driver Branch#
Before you install a new NVIDIA driver branch, to check the currently installed driver branch, run the following command:
apt list --installed nvidia-driver*server
Determining the New Available Driver Branches#
These steps help you determine which new driver branches are available.
To see the new available NVIDIA driver branches:
Update the local database with the latest information from the Ubuntu repository.
sudo apt update
Show all available driver branches.
apt list nvidia-driver*server
Optional: Show the available NVIDIA Open GPU Kernel module branches.
apt list nvidia-driver-*-server-open
Caution
The NVIDIA Open GPU Kernel module drivers are not supported on NVIDIA DGX-1, DGX-2, and DGX Station systems.
Upgrading Your GPU Branch#
To manually upgrade your driver to the latest branch:
Before running the
apt update
orapt upgrade
command,If the installed GPU driver is R465 or higher,
Check whether the
nvidia-peermem-loader
package is installed:dpkg -l | grep nvidia-peermem-loader
If it is not installed, run:
apt install nvidia-peermem-loader
If the installed GPU driver is lower than R465,
Check whether the
nvidia-peer-memory
andnvidia-peer-memory-dkms
packages are installed:dpkg -l | grep nvidia-peer-memory
If they are not installed, run:
apt install nvidia-peer-memory nvidia-peer-memory-dkms
Install the latest kernel.
sudo apt install -y linux-generic
Install the latest NVIDIA GPU driver.
In the following commands, the trailing
-
character in*nvidia*${GPU_BRANCH}*-
, specifies to remove the old driver in the same transaction. Because this operation removes packages from the system, it is important to perform a dry-run first and ensure that the correct packages will be removed.Set
GPU_BRANCH
to the latest branch version, such as 535.On non-Fabric Manager systems, such as DGX-1, DGX Station V100 (Volta), and DGX Station A100, run the following command:
GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1) # Specify --dry-run to check the packages to install. sudo apt-get install -y linux-modules-nvidia-535-server-generic nvidia-driver-535-server libnvidia-nscq-535 nvidia-modprobe nvidia-conf-xconfig nv-docker-gpus "*nvidia*${GPU_BRANCH}*-" --dry-run # Install the packages. sudo apt-get install -y linux-modules-nvidia-535-server-generic nvidia-driver-535-server libnvidia-nscq-535 nvidia-modprobe nvidia-conf-xconfig nv-docker-gpus "*nvidia*${GPU_BRANCH}*-"
For DGX Station V100 (Volta) and DGX Station A100 systems, enable and start the monitor display:
sudo systemctl enable nvidia-conf-xconfig sudo systemctl start nvidia-conf-xconfig
On Fabric Manager systems, such as DGX-2 and DGX A100, run the same command, but append the
nvidia-fabricmanager-535
package:GPU_BRANCH=$(dpkg -l | grep nvidia-driver | tr -s " " | cut -d' ' -f3 | cut -d'.' -f1) # Specify the --dry-run to check the packages to install. sudo apt-get install -y linux-modules-nvidia-535-server-generic nvidia-driver-535-server libnvidia-nscq-535 nvidia-modprobe nvidia-fabricmanager-535 "*nvidia*${GPU_BRANCH}*-" --dry-run # Install the packages. sudo apt-get install -y linux-modules-nvidia-535-server-generic nvidia-driver-535-server libnvidia-nscq-535 nvidia-modprobe nvidia-fabricmanager-535 "*nvidia*${GPU_BRANCH}*-"
Note
The driver versions are only used as an example. Replace the value with the version that you want to install.
The sample commands do not demonstrate installing the Open GPU Kernel module drivers. Specify the
-server-open
package name suffix, such asnvidia-driver-535-server-open
, instead of withnvidia-driver-535-server
to use the Open GPU Kernel module drivers.Before you reboot your DGX-2 or DGX A100 system, enable the
nvidia-fabricmanager
service.sudo systemctl unmask nvidia-fabricmanager
sudo systemctl enable nvidia-fabricmanager
If you are using a DGX-1, DGX-2, or DGX A100 system, run the following commands to install the peer memory package.
Note
If you are using the NVIDIA GPU driver family releases R465 and higher, these GPU drivers provide their own peer memory functionality that causes conflicts with the old
nvidia-peer-memory
packages. You should install thenvidia-peermem-loader
package instead of thenvidia-peer-memory
package.sudo apt install -y nvidia-peermem-loader
To install the
nvidia-peer-memory
package:sudo apt install -y --reinstall nvidia-peer-memory-dkms
Restart the
nvidia-peer-memory
service:sudo /usr/sbin/update-rc.d nv_peer_mem defaults
If you are upgrading from a branch older than R515 to a driver branch R515 or newer, or if you are downgrading from a branch R515 or newer to an older branch than R515, install the correct DCGM version. You can skip this step, otherwise.
If you are upgrading to a branch R515 or newer from a branch older than R515, identify the latest DCGM 3.x version:
apt-cache policy datacenter-gpu-manager
Example Output
datacenter-gpu-manager: Installed: 1:3.0.4 Candidate: 1:3.1.3 Version table: 1:3.1.3 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:3.0.4 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 100 /var/lib/dpkg/status *** 1:2.4.7 600 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages
Identify the latest DCGM 3.x version. In the case above, this would be
1:3.1.3
. Install the latest DCGM 3.x version:sudo apt install datacenter-gpu-manager=1:3.1.3
If you are downgrading to a branch older than R510 from R515 or a newer branch (note that R510 is a transitory package for R515) then install DCGM version 2:
sudo apt install datacenter-gpu-manager/$(lsb_release -cs)-updates -y --allow-downgrades
Note
The driver branches R510 and earlier depend on NSCQ v1, while R515 and later have a dependency on NSCQ v2. They require different releases of DCGM that are hosted in different repositories (DGX and CUDA). The DGX repository is configured with a higher priority to prevent APT from upgrading DCGM to an unsupported version when a driver release R510∑or older is installed.
The steps above override the version to install DCGM 3.x for drivers R515+. Once the installed version is greater than the prioritized version, the APT preferences will no longer be used. Users will be able use APT for DCGM 3.x upgrades as part of the usual “apt upgrade” process.
Installing or Upgrading to a Newer CUDA Toolkit Release#
Only DGX Station and DGX Station A100 have a CUDA Toolkit release installed by default. DGX servers are intended to be shared resources that use containers and do not have CUDA Toolkit installed by default. However, you have the option to install a qualified CUDA Toolkit release.
Although the DGX OS supports all CUDA Toolkit releases that interoperate with the installed driver, DGX OS releases might include a default CUDA Toolkit release that might not be the most recently released version. Unless you must use a new CUDA Toolkit version that contains the new features, we recommend that you remain on the default version that is included in the DGX OS release. Refer to the DGX OS Software Release Notes for the default CUDA Toolkit release.
Important
Before you install or upgrade to any CUDA Toolkit release, ensure the release is compatible with the driver that is installed on the system. Refer to CUDA Compatibility for more information and a compatibility matrix.
CUDA Compatibility Matrix and Forward Compatibility#
Each CUDA toolkit requires a minimum GPU driver version. This compatibility matrix is documented in CUDA Compatibility: Use the Right Compat Package.
A newer CUDA Toolkit may be used with older GPU drivers if the appropriate forward compatibility package is installed. Refer to CUDA Compatibility for more information.
Checking the Currently Installed CUDA Toolkit Release#
Here is some information about the prerequisite to determine the CUDA Toolkit release that you currently have installed.
Important
The CUDA Toolkit is not installed on DGX servers by default, and if you try to run the following command, no installed package will be listed.
Before you install a new CUDA Toolkit release, to check the currently installed release, run the following command:
apt list --installed cuda-toolkit-*
For example, the following output shows that CUDA Toolkit 11.0 is installed:
apt list --installed cuda-toolkit-*
Listing... Done
cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed]
N: There is 1 additional version. Please use the '-a' switch to see it
Installing or Upgrading the CUDA Toolkit#
These steps help you determine which new CUDA Toolkit releases are available.
To see the new available CUDA Toolkit releases:
Update the local database with the latest information from the Ubuntu repository.
apt update
Show all available CUDA Toolkit releases.
apt list cuda-toolkit-*
The following output shows that 11.0 is already installed and 11.1 and 11.2 are the possible CUDA Toolkit versions that can be installed:
Listing... Done cuda-toolkit-11-0/unknown,unknown,now 11.0.3-1 amd64 [installed] cuda-toolkit-11-1/unknown,unknown 11.1.1-1 amd64 cuda-toolkit-11-2/unknown,unknown 11.2.1-1 amd64
To install or upgrade the CUDA Toolkit, run the following:
apt install cuda-toolkit-<version>
Replace <version> with the actual version that you want to install. You only need to specify the first two fields, for example, 11.1 or 11.2.
Installing or Upgrading GPUDirect Storage#
NVIDIA Magnum IO GPUDirect Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage. This software avoids a bounce buffer through the CPU.
Prerequisites#
For systems other than NVIDIA DGX-1, DGX-2, and DGX Station, to use the latest GDS version, 12.2.2-1, that is provided by nvidia-fs-dkms-2.17.5-1, you must install an NVIDIA Open GPU Kernel module driver. Refer to Upgrading Your GPU Branch for more information about installing the driver.
For NVIDIA DGX-1, DGX-2, and DGX Station, the GPUs in these systems are not supported with the NVIDIA Open GPU Kernel modules. The GDS versions 12.2.2-1 and higher only support the Open GPU Kernel modules.
For these systems, you must pin the nvidia-fs package to version 2.17.3 or lower and the nvidia-gds package to version 12.2.1-1 or lower.
Create an
/etc/apt/preferences.d/nvidia-fs
file with contents like the following:Package: nvidia-fs Pin-Priority: 900 Pin: version 2.17.3-1 Package: nvidia-gds Pin-Priority: 900 Pin: version 12.2.1-1
Verify that the nvidia-fs package preference is correct.
sudo apt-cache policy nvidia-fs
Example Output
nvidia-fs: Installed: (none) Candidate: 2.17.3-1 Version table: 2.17.5-1 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 2.17.3-1 900 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages ...
Verify that the nvidia-gds package preference is correct.
sudo apt-cache policy nvidia-gds
Example Output
nvidia-gds: Installed: (none) Candidate: 12.2.1-1 Version table: 12.2.2-1 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 12.2.1-1 900 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages ...
For NVIDIA DGX-1, DGX-2, and DGX Station, disable IOMMU to avoid a DMAR penalty.
Edit the GRUB configuration file.
sudo vi /etc/default/grub
Add
intel_iommu=off
to theGRUB_CMDLINE_LINUX_DEFAULT
variable.If the variable already includes other options, enter a space to separate the options. Refer to the following example.
... GRUB_CMDLINE_LINUX_DEFAULT="console=tty0 intel_iommu=off" ...
Generate the GRUB bootloader.
sudo update-grub sudo reboot
After the system reboots, verify the change took effect.
cat /proc/cmdline
Example Output
BOOT_IMAGE=/boot/vmlinuz-... console=tty0 intel_iommu=off
Procedure#
Install the nvidia-gds
package with the correct dependencies:
Set the
NVIDIA_DRV_VERSION
environment variable to the driver version.NVIDIA_DRV_VERSION=$(cat /proc/driver/nvidia/version | grep Module | awk '{print $8}' | cut -d '.' -f 1)
Install the nvidia-gds package.
For NVIDIA DGX-1, DGX-2, and DGX Station that must use version
12.2.1-1
.sudo apt install nvidia-gds-12-2=12.2.1-1 nvidia-dkms-${NVIDIA_DRV_VERSION}-server
For other NVIDIA DGX Systems.
sudo apt install nvidia-gds-<version> nvidia-dkms-${NVIDIA_DRV_VERSION}-server
Use the CUDA Toolkit version number in place of <version>, such as
12-2
.
Refer to Verifying a Successful GDS Installation in the NVIDIA GPUDirect Storage Installation and Troubleshooting Guide.
Installing nvidia_peermem#
For CUDA 11.5.1 and later, if you plan to use Weka FS or IBM SpectrumScale then you need to run:
modprobe nvidia_peermem
This will load the module that supports peer-direct capabilities. It is necessary to run this command after reboot of the system.
In order to load the module automatically after every reboot, run the following command:
echo "nvidia-peermem" | sudo tee /etc/modules-load.d/nvidia-peermem.conf
Note
If the nvidia_peer_memory
module is not loading:
DGX OS 5.1.1 provides nv_peer_mem 1.2 and MLNX_OFED 5.4-3.1.0.0 to
resolve an issue discovered in MLNX_OFED 5.4-1.0.3.0. nv_peer_mem
1.2 isn’t compatible with MLNX_OFED <= 5.4-1.0.3.0, and attempting to
use nv_peer_mem
1.2 with MLNX_OFED <= 5.4-1.0.3.0 will result in a
error such as the one below:
cat /var/lib/dkms/nv_peer_mem/1.2/build/make.log
DKMS make.log for nv_peer_mem-1.2 for kernel 5.4.0-92-generic (x86_64)
Wed Jan 5 20:36:09 UTC 2022
INFO: Building with MLNX_OFED from: /usr/src/ofa_kernel/default
If you must use MLNX_OFED <= 5.4-1.0.3.0 and have encountered this issue, then
it is recommended to downgrade to nv_peer_mem
1.1.
sudo apt install --reinstall nvidia-peer-memory-dkms=1.1-0-nvidia2