Installing the DGX Software
This section requires that you have already installed Red Hat Enterprise Linux or derived operating system on the DGX™ system. You can skip this section of you already installed the DGX software stack during a kickstart install.
Important
Before performing the installation, refer to the Release Notes for the latest information and additional instructions depending on the specific release.
Configuring a System Proxy
If your network requires you to use a proxy:
Edit the file
/etc/dnf/dnf.conf
and make sure the following lines are present in the[main]
section, using the parameters that apply to your network:proxy=http://<Proxy-Server-IP-Address>:<Proxy-Port> proxy_username=<Proxy-User-Name> proxy_password=<Proxy-Password>
Enabling the DGX Software Repository
Attention
By running these commands you are confirming that you have read and agree to be bound by the DGX Software License Agreement. You are also confirming that you understand that any pre-release software and materials available that you elect to install in a DGX might not be fully functional, might contain errors or design flaws, and might have reduced or different security, privacy, availability, and reliability standards relative to commercial versions of NVIDIA software and materials, and that you use pre-release versions at your own risk.
Install the NVIDIA DGX Package for Red Hat Enterprise Linux.
sudo dnf install -y https://repo.download.nvidia.com/baseos/el/el-files/9/nvidia-repo-setup-22.12-1.el9.x86_64.rpm
Installing Required Components
On Red Hat Enterprise Linux, run the following commands to enable additional repositories required by the DGX software.
sudo subscription-manager repos --enable=rhel-9-for-x86_64-appstream-rpms sudo subscription-manager repos --enable=rhel-9-for-x86_64-baseos-rpms sudo subscription-manager repos --enable=codeready-builder-for-rhel-9-x86_64-rpms
Upgrade to the latest software.
sudo dnf update -y --nobest
Install DGX tools and configuration files.
For DGX-1, install DGX-1 Configurations.
sudo dnf group install -y 'DGX-1 Configurations'
For the DGX-2, install DGX-2 Configurations.
sudo dnf group install -y 'DGX-2 Configurations'
For the DGX A100, install DGX A100 Configurations.
sudo dnf group install -y 'DGX A100 Configurations'
For the DGX A800, install DGX A800 Configurations.
sudo dnf group install -y 'DGX A800 Configurations'
For the DGX H100, install DGX H100 Configurations.
sudo dnf group install -y 'DGX H100 Configurations'
For the DGX Station, install DGX Station Configurations.
sudo dnf group install -y 'DGX Station Configurations'
For the DGX Station A100, install DGX Station A100 Configurations.
sudo dnf group install -y 'DGX Station A100 Configurations'
The configuration changes take effect only after rebooting the system. To reduce the number of reboots, you can defer rebooting until after you install the drivers.
Configuring Data Drives
The data drives in the DGX systems can be configured as RAID 0 or RAID 5. RAID 0 provides the maximum storage capacity and performance, but does not provide any redundancy.
RAID 0 is often used for data caching. You can use cachefilesd to provide a cache for NFS shares.
Important
You can change the RAID level later but this will destroy the data on those drives.
Except for the DGX-1, the RAID configuration can be configured during the operating system installation. If you have already configured the RAID array during the installation, you can skip the first step and go to step 2.
Configure the
/raid
partition.All DGX systems support RAID 0 or RAID 5 arrays.
The following commands create a RAID array, mount it to
/raid
and create an appropriate entry in/etc/fstab
.To create a RAID 0 array:
sudo /usr/bin/configure_raid_array.py -c -f
To create a RAID 5 array:
sudo /usr/bin/configure_raid_array.py -c -f -5
Note
The RAID array must be configured before installing
nvidia-conf-cachefilesd
, which places the proper SELinux label on the/raid
directory. If you ever need to recreate the RAID array — which will wipe out any labeling on/raid
— afternvidia-conf-cachefilesd
has already been installed, be sure to restore the label manually before restartingcachefilesd
.sudo restorecon /raid sudo systemctl restart cachefilesd
(Optional) Install tools for managing the self-encrypting drives (SED) for the data drives on DGX A100, DGX A800, or DGX H100 systems.
Refer to Managing Self-Encrypting Drives for more information.
(Optional) If you wish to use your RAID array for caching, install
nvidia-conf-cachefilesd
. This will update thecachefilesd
configuration to use the/raid
partition.sudo dnf install -y nvidia-conf-cachefilesd
Installing the GPU Driver
You have the option to choose between different GPU driver branches for your DGX system. The latest driver release includes new features but might not provide the same support duration as an older release. Refer to the release notes at the NVIDIA Driver Documentation for more details and the minimum required driver release for the GPU architecture.
Display a list of available drivers.
dnf module list nvidia-driver
Example Output
Name Stream Profiles Summary nvidia-driver latest default [d], fm, ks, src Nvidia driver for latest branch nvidia-driver latest-dkms [d] default [d], fm, ks Nvidia driver for latest-dkms branch nvidia-driver open-dkms default [d], fm, ks, src Nvidia driver for open-dkms branch nvidia-driver 515 default [d], fm, ks, src Nvidia driver for 515 branch nvidia-driver 515-dkms default [d], fm, ks Nvidia driver for 515-dkms branch nvidia-driver 515-open default [d], fm, ks, src Nvidia driver for 515-open branch nvidia-driver 520 default [d], fm, ks, src Nvidia driver for 520 branch nvidia-driver 520-dkms default [d], fm, ks Nvidia driver for 520-dkms branch nvidia-driver 520-open default [d], fm, ks, src Nvidia driver for 520-open branch nvidia-driver 525 default [d], fm, ks, src Nvidia driver for 525 branch nvidia-driver 525-dkms default [d], fm, ks Nvidia driver for 525-dkms branch nvidia-driver 525-open default [d], fm, ks, src Nvidia driver for 525-open branch nvidia-driver 530 default [d], fm, ks, src Nvidia driver for 530 branch nvidia-driver 530-dkms default [d], fm, ks Nvidia driver for 530-dkms branch nvidia-driver 530-open default [d], fm, ks, src Nvidia driver for 530-open branch nvidia-driver 535 [e] default [d] [i], fm, ks, src [i] Nvidia driver for 535 branch nvidia-driver 535-dkms default [d], fm, ks Nvidia driver for 535-dkms branch nvidia-driver 535-open default [d], fm, ks, src Nvidia driver for 535-open branch
Before installing the NVIDIA CUDA driver and configure the system,
Replace the release version (
535
) used as an example in step 2 with the release you want to install.If the
Stream
column in the example output does not display the compiled driver version you want (for example,535
) but only the DKMS version (535-dkms
), replace thesudo dnf module install --nobest -y nvidia-driver:535...
command with the following command in step 2 regardless of the system:sudo dnf module install --nobest -y nvidia-driver:535-dkms/fm
Ensure that the driver release you intend to install is supported by the GPU in the system.
Caution
A known issue has been identified with DGX Station A100. For more information, refer to NVIDIA GPU Driver 550 Not Supported on DGX Station A100.
Install the NVIDIA CUDA driver.
For non-NVSwitch systems, such as DGX-1, DGX Station, and DGX Station A100, install the driver using the default and src profiles:
sudo dnf module install --nobest -y nvidia-driver:535/{default,src} sudo dnf install -y nv-persistence-mode libnvidia-nscq-535
For NVSwitch systems, such as DGX-2, DGX A100, DGX H100, and DGX A800, install the driver using the fabric manager (fm) and source (src) profiles:
sudo dnf module install --nobest -y nvidia-driver:535/{fm,src} sudo dnf install -y nv-persistence-mode nvidia-fm-enable
(DGX Station A100 only) Install additional packages required for DGX Station A100.
These packages must be installed after installation of the
nvidia-driver
module.sudo dnf install -y nvidia-conf-xconfig nv-docker-gpus
The configuration changes take effect only after rebooting the system. To reduce the number of reboots, you can defer rebooting until after you install the NVIDIA Container Runtime group.
Install and configure the NVIDIA Container Toolkit with Docker CE or Podman.
Choose one of the following options:
Installing and Running Docker CE
To run an NVIDIA container with Docker CE:
Install the NVIDIA container device plugin along with Docker CE.
Install the NVIDIA Container Runtime group:
sudo dnf group install -y --allowerasing 'NVIDIA Container Runtime'
Reboot the system to load the drivers and to update system configurations.
Reboot the system.
sudo reboot
After the system reboots, verify that the drivers are loaded and are handling the NVIDIA devices.
nvidia-smi
The output shows all available GPUs.
Example Output
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 | | N/A 35C P0 42W / 300W | 0MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 | | N/A 35C P0 44W / 300W | 0MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ ... +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 | | N/A 35C P0 43W / 300W | 0MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Run the following command to verify the installation:
sudo docker run --gpus=all --rm nvcr.io/nvidia/cuda:12.2.0-base-ubi8 nvidia-smi
The output shows all available GPUs.
For information about
nvcr.io
, refer to the NGC Private Registry User Guide.
Installing and Running Podman
To run an NVIDIA container with Podman:
Install Podman.
sudo dnf install podman
Install the
nvidia-container-toolkit-base
package.sudo dnf clean expire-cache && sudo dnf install -y nvidia-container-toolkit-base
Check the NVIDIA Container Toolkit version.
nvidia-ctk --version
Generate the Container Device Interface (CDI) specification file.
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
The sample command uses
sudo
to ensure that the file at/etc/cdi/nvidia.yaml
is created. You can omit the--output
argument to print the generated specification toSTDOUT
.Verify that the GPU drivers are loaded and are handling the NVIDIA devices.
nvidia-smi -L
Run the following command to verify the installation.
sudo podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
Verify your installation by running a sample container with Podman.
podman run --security-opt=label=disable --rm --device nvidia.com/gpu=all ubuntu nvidia-smi
Installing Optional Components
The DGX is fully functional after installing the components as described in Installing Required Components. If you intend to launch NGC containers (which incorporate the CUDA toolkit, NCCL, cuDNN, and TensorRT) on the DGX system, which is the expected use case, then you can skip this section.
If you intend to use your DGX as a development system for running deep learning applications on bare metal, then install the optional components as described in this section. Install CUDA Toolkit 12.2 packages (see Installing the NVIDIA CUDA Driver from the Local Repository)
sudo dnf install -y cuda-toolkit-12-2 cuda-compat-12-2 nvidia-cuda-compat-setup
Note
The output of nvidia-smi
shows the version of CUDA that is native-compatible with the installed driver
(e.g. “NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2” in the prior steps).
It is recommended to install the CUDA toolkit and compatible packages which match this version.
To install the NVIDIA Collectives Communication Library (NCCL) Runtime, refer to the NCCL:Getting Started documentation
sudo dnf group install -y 'NVIDIA Collectives Communication Library Runtime'
To install the CUDA Deep Neural Networks (cuDNN) Library Runtime, refer to the NVIDIA cuDNN page.
sudo dnf group install -y 'CUDA Deep Neural Networks Library Runtime'
To install NVIDIA TensorRT, refer to the NVIDIA TensorRT page.
Installing NVIDIA GPUDirect Storage
Prerequisites
For systems other than NVIDIA DGX-1, DGX-2, and DGX Station, to use the latest GDS version, 12.2.2-1, that is provided by nvidia-fs-dkms-2.17.5-1, you must install an NVIDIA Open GPU Kernel module driver. Refer to Installing the GPU Driver for more information about installing the driver.
For NVIDIA DGX-1, DGX-2, and DGX Station running the generic Linux Kernel, the GPUs in these systems are not supported with the NVIDIA Open GPU Kernel modules. The GDS versions 12.2.2-1 and higher only support the Open GPU Kernel modules.
For these systems, you must lock the nvidia-fs package to version 2.17.3 or lower and the nvidia-gds package to version 12.2.1-1 or lower.
sudo dnf install python3-dnf-plugin-versionlock sudo dnf versionlock add nvidia-fs-0:2.17.3-1 nvidia-fs-dkms-0:2.17.3-1 nvidia-gds-0:12.2.1-1
Example Output
Adding versionlock on: nvidia-fs-0:2.17.3-1.* Adding versionlock on: nvidia-gds-0:12.2.1-1.*
Procedure
To install NVIDIA GPUDirect Storage (GDS), perform the following steps.
Install the kernel headers and development packages for your kernel.
sudo dnf install -y kernel-headers-$(uname -r) kernel-devel-$(uname -r)
Install the GDS package.
sudo dnf install -y nvidia-gds
Refer to Verifying a Successful GDS Installation in the NVIDIA GPUDirect Storage Installation and Troubleshooting Guide.
Installing the Optional NVIDIA Desktop Theme
The DGX Software Repository also provides optional theme packages and desktop wallpapers to give the user-interface an NVIDIA look-and-feel for the DGX Station desktop. These packages would have been installed as part of the DGX Station Configurations group.
To apply the theme and background images, first open gnome-tweaks.
Under Applications, select one of the NV-Yaru themes. This comes in default, light, and dark variations.
Under Shell, select the NV-Yaru-dark theme.
If this field is grayed out, you might need to reboot the system or restart GDM in order to enable the user-themes extension.
To restart GDM, issue the following.
sudo systemctl restart gdm
Select one of the NVIDIA wallpapers for the background image and lock screen.