Installing the DGX Software

This section requires that you have already installed Red Hat Enterprise Linux or derived operating system on the DGX system. You can skip this section of you already installed the DGX software stack during a kickstart install.

Important

Before performing the installation, refer to the Release Notes for the latest information and additional instructions depending on the specific release.

Configuring a System Proxy

If your network requires you to use a proxy:

  • Edit the file /etc/dnf/dnf.conf and make sure the following lines are present in the [main] section, using the parameters that apply to your network:

    proxy=http://<Proxy-Server-IP-Address>:<Proxy-Port>
    proxy_username=<Proxy-User-Name>
    proxy_password=<Proxy-Password>
    

Enabling the DGX Software Repository

Attention

By running these commands you are confirming that you have read and agree to be bound by the DGX Software License Agreement. You are also confirming that you understand that any pre-release software and materials available that you elect to install in a DGX might not be fully functional, might contain errors or design flaws, and might have reduced or different security, privacy, availability, and reliability standards relative to commercial versions of NVIDIA software and materials, and that you use pre-release versions at your own risk.

Install the NVIDIA DGX Package for Red Hat Enterprise Linux.

sudo dnf install -y https://repo.download.nvidia.com/baseos/el/el-files/9/nvidia-repo-setup-22.12-1.el9.x86_64.rpm

Installing Required Components

  1. Upgrade to the latest software.

    sudo dnf update -y --nobest
    
  2. Install DGX tools and configuration files.

    • For DGX-1, install DGX-1 Configurations.

      sudo dnf group install -y 'DGX-1 Configurations'
      
    • For the DGX-2, install DGX-2 Configurations.

      sudo dnf group install -y 'DGX-2 Configurations'
      
    • For the DGX A100, install DGX A100 Configurations.

      sudo dnf group install -y 'DGX A100 Configurations'
      
    • For the DGX A800, install DGX A800 Configurations.

      sudo dnf group install -y 'DGX A800 Configurations'
      
    • For the DGX H100, install DGX H100 Configurations.

      sudo dnf group install -y 'DGX H100 Configurations'
      
    • For the DGX Station, install DGX Station Configurations.

      sudo dnf group install -y 'DGX Station Configurations'
      
    • For the DGX Station A100, install DGX Station A100 Configurations.

      sudo dnf group install -y 'DGX Station A100 Configurations'
      

    The configuration changes take effect only after rebooting the system. To reduce the number of reboots, you can defer rebooting until after you install the drivers.

Configuring Data Drives

The data drives in the DGX systems can be configured as RAID 0 or RAID 5. RAID 0 provides the maximum storage capacity and performance, but does not provide any redundancy.

RAID 0 is often used for data caching. You can use cachefilesd to provide a cache for NFS shares.

Important

You can change the RAID level later but this will destroy the data on those drives.

Except for the DGX-1, the RAID configuration can be configured during the operating system installation. If you have already configured the RAID array during the installation, you can skip the first step and go to step 2.

  1. Configure the /raid partition.

    All DGX systems support RAID 0 or RAID 5 arrays.

    The following commands create a RAID array, mount it to /raid and create an appropriate entry in /etc/fstab.

    • To create a RAID 0 array:

      sudo /usr/bin/configure_raid_array.py -c -f
      
    • To create a RAID 5 array:

      sudo /usr/bin/configure_raid_array.py -c -f -5
      

    Note

    The RAID array must be configured before installing nvidia-conf-cachefilesd, which places the proper SELinux label on the /raid directory. If you ever need to recreate the RAID array — which will wipe out any labeling on /raid — after nvidia-conf-cachefilesd has already been installed, be sure to restore the label manually before restarting cachefilesd.

    sudo restorecon /raid
    sudo systemctl restart cachefilesd
    
  2. (Optional) Install tools for managing the self-encrypting drives (SED) for the data drives on DGX A100, DGX A800, or DGX H100 systems.

    Refer to Managing Self-Encrypting Drives for more information.

  3. (Optional) If you wish to use your RAID array for caching, install nvidia-conf-cachefilesd. This will update the cachefilesd configuration to use the /raid partition.

    sudo dnf install -y nvidia-conf-cachefilesd
    

Installing the GPU Driver

You have the option to choose between different GPU driver branches for your DGX system. The latest driver release includes new features but might not provide the same support duration as an older release. Refer to the release notes at the NVIDIA Driver Documentation for more details and the minimum required driver release for the GPU architecture.

  1. Display a list of available drivers.

    dnf module list nvidia-driver
    

    Example Output

    Name                                    Stream                                    Profiles                                                   Summary
    nvidia-driver                           latest                                    default [d], fm, ks, src                                   Nvidia driver for latest branch
    nvidia-driver                           latest-dkms [d]                           default [d], fm, ks                                        Nvidia driver for latest-dkms branch
    nvidia-driver                           open-dkms                                 default [d], fm, ks, src                                   Nvidia driver for open-dkms branch
    nvidia-driver                           515                                       default [d], fm, ks, src                                   Nvidia driver for 515 branch
    nvidia-driver                           515-dkms                                  default [d], fm, ks                                        Nvidia driver for 515-dkms branch
    nvidia-driver                           515-open                                  default [d], fm, ks, src                                   Nvidia driver for 515-open branch
    nvidia-driver                           520                                       default [d], fm, ks, src                                   Nvidia driver for 520 branch
    nvidia-driver                           520-dkms                                  default [d], fm, ks                                        Nvidia driver for 520-dkms branch
    nvidia-driver                           520-open                                  default [d], fm, ks, src                                   Nvidia driver for 520-open branch
    nvidia-driver                           525                                       default [d], fm, ks, src                                   Nvidia driver for 525 branch
    nvidia-driver                           525-dkms                                  default [d], fm, ks                                        Nvidia driver for 525-dkms branch
    nvidia-driver                           525-open                                  default [d], fm, ks, src                                   Nvidia driver for 525-open branch
    nvidia-driver                           530                                       default [d], fm, ks, src                                   Nvidia driver for 530 branch
    nvidia-driver                           530-dkms                                  default [d], fm, ks                                        Nvidia driver for 530-dkms branch
    nvidia-driver                           530-open                                  default [d], fm, ks, src                                   Nvidia driver for 530-open branch
    nvidia-driver                           535 [e]                                   default [d] [i], fm, ks, src [i]                           Nvidia driver for 535 branch
    nvidia-driver                           535-dkms                                  default [d], fm, ks                                        Nvidia driver for 535-dkms branch
    nvidia-driver                           535-open                                  default [d], fm, ks, src                                   Nvidia driver for 535-open branch
    

    The following steps install the NVIDIA CUDA driver and configure the system. Replace the release version used as an example (535) with the release you want to install. Ensure that the driver release you intend to install is supported by the GPU in the system.

  2. Install the NVIDIA CUDA driver.

    1. For non-NVSwitch systems, such as DGX-1, DGX Station, and DGX Station A100, install the driver using the default and src profiles:

      sudo dnf module install --nobest -y nvidia-driver:535/{default,src}
      sudo dnf install -y nv-persistence-mode libnvidia-nscq-535
      
    2. For NVSwitch systems, such as DGX-2, DGX A100, and DGX A800, install the driver using the fabric manager (fm) and source (src) profiles:

      sudo dnf module install --nobest -y nvidia-driver:535/{fm,src}
      sudo dnf install -y nv-persistence-mode nvidia-fm-enable
      
    3. For DGX H100, install the DKMS version of the driver using the fabric manager (fm) profile:

      sudo dnf module install --nobest -y nvidia-driver:535-dkms/fm
      sudo dnf install -y nv-persistence-mode nvidia-fm-enable
      
  3. (DGX Station A100 Only) Install additional packages required for DGX Station A100.

    These packages must be installed after installation of the nvidia-driver module.

    sudo dnf install -y nvidia-conf-xconfig nv-docker-gpus
    

    The configuration changes take effect only after rebooting the system. To reduce the number of reboots, you can defer rebooting until after you install the NVIDIA Container Runtime group.

  4. Install and configure the NVIDIA Container Toolkit with Docker CE or Podman.

    Choose one of the following options:

Installing and Running Docker CE

To run an NVIDIA container with Docker CE:

  1. Install the NVIDIA container device plugin along with Docker CE.

    Install the NVIDIA Container Runtime group:

    sudo dnf group install -y --allowerasing 'NVIDIA Container Runtime'
    
  2. Reboot the system to load the drivers and to update system configurations.

    1. Reboot the system.

      sudo reboot
      
    2. After the system reboots, verify that the drivers are loaded and are handling the NVIDIA devices.

      nvidia-smi
      

      The output shows all available GPUs.

      Example Output

      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 535.86.10   Driver Version: 535.86.10    CUDA Version: 12.2      |
      |-------------------------------+----------------------+----------------------+
      | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
      |                               |                      |               MIG M. |
      |===============================+======================+======================|
      |   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
      | N/A   35C    P0    42W / 300W |      0MiB / 16160MiB |      0%      Default |
      |                               |                      |                  N/A |
      +-------------------------------+----------------------+----------------------+
      |   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
      | N/A   35C    P0    44W / 300W |      0MiB / 16160MiB |      0%      Default |
      |                               |                      |                  N/A |
      +-------------------------------+----------------------+----------------------+
      ...
      +-------------------------------+----------------------+----------------------+
      |   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
      | N/A   35C    P0    43W / 300W |      0MiB / 16160MiB |      0%      Default |
      |                               |                      |                  N/A |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | Processes:                                                                  |
      |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
      |        ID   ID                                                   Usage      |
      |=============================================================================|
      |  No running processes found                                                 |
      +-----------------------------------------------------------------------------+
      
  3. Run the following command to verify the installation:

    sudo docker run --gpus=all --rm nvcr.io/nvidia/cuda:12.2.0-base-ubi8 nvidia-smi
    

    The output shows all available GPUs.

    For information about nvcr.io, refer to the NGC Private Registry User Guide.

Installing and Running Podman

To run an NVIDIA container with Podman:

  1. Install Podman.

    sudo dnf install podman
    
  2. Install the nvidia-container-toolkit-base package.

    sudo dnf clean expire-cache && sudo dnf install -y nvidia-container-toolkit-base
    
  3. Check the NVIDIA Container Toolkit version.

    nvidia-ctk --version
    
  4. Generate the Container Device Interface (CDI) specification file.

    sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
    

    The sample command uses sudo to ensure that the file at /etc/cdi/nvidia.yaml is created. You can omit the --output argument to print the generated specification to STDOUT.

  5. Verify that the GPU drivers are loaded and are handling the NVIDIA devices.

    nvidia-smi -L
    
  6. Run the following command to verify the installation.

    sudo podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
    
  7. Verify your installation by running a sample container with Podman.

    podman run --security-opt=label=disable --rm --device nvidia.com/gpu=all  ubuntu  nvidia-smi
    

Installing Optional Components

The DGX is fully functional after installing the components as described in Installing Required Components. If you intend to launch NGC containers (which incorporate the CUDA toolkit, NCCL, cuDNN, and TensorRT) on the DGX system, which is the expected use case, then you can skip this section.

If you intend to use your DGX as a development system for running deep learning applications on bare metal, then install the optional components as described in this section. Install CUDA Toolkit 12.2 packages (see Installing the NVIDIA CUDA Driver from the Local Repository)

sudo dnf install -y cuda-toolkit-12-2 cuda-compat-12-2 nvidia-cuda-compat-setup

Note

The output of nvidia-smi shows the version of CUDA that is native-compatible with the installed driver (e.g. “NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2” in the prior steps). It is recommended to install the CUDA toolkit and compatible packages which match this version.

  • To install the NVIDIA Collectives Communication Library (NCCL) Runtime, refer to the NCCL:Getting Started documentation

    sudo dnf group install -y 'NVIDIA Collectives Communication Library Runtime'
    
  • To install the CUDA Deep Neural Networks (cuDNN) Library Runtime, refer to the NVIDIA cuDNN page.

    sudo dnf group install -y 'CUDA Deep Neural Networks Library Runtime'
    
  • To install NVIDIA TensorRT, refer to the NVIDIA TensorRT page.

Installing NVIDIA GPUDirect Storage

Prerequisites

  • For systems other than NVIDIA DGX-1, DGX-2, and DGX Station, to use the latest GDS version, 12.2.2-1, that is provided by nvidia-fs-dkms-2.17.5-1, you must install an NVIDIA Open GPU Kernel module driver. Refer to Installing the GPU Driver for more information about installing the driver.

  • For NVIDIA DGX-1, DGX-2, and DGX Station running the generic Linux Kernel, the GPUs in these systems are not supported with the NVIDIA Open GPU Kernel modules. The GDS versions 12.2.2-1 and higher only support the Open GPU Kernel modules.

    For these systems, you must lock the nvidia-fs package to version 2.17.3 or lower and the nvidia-gds package to version 12.2.1-1 or lower.

    sudo dnf install python3-dnf-plugin-versionlock
    sudo dnf versionlock add nvidia-fs-0:2.17.3-1 nvidia-fs-dkms-0:2.17.3-1 nvidia-gds-0:12.2.1-1
    

    Example Output

    Adding versionlock on: nvidia-fs-0:2.17.3-1.*
    Adding versionlock on: nvidia-gds-0:12.2.1-1.*
    

Procedure

To install NVIDIA GPUDirect Storage (GDS), perform the following steps.

  1. Install the kernel headers and development packages for your kernel.

    sudo dnf install -y kernel-headers-$(uname -r) kernel-devel-$(uname -r)
    
  2. Install the GDS package.

    sudo dnf install -y nvidia-gds
    

Refer to Verifying a Successful GDS Installation in the NVIDIA GPUDirect Storage Installation and Troubleshooting Guide.

Installing the Optional NVIDIA Desktop Theme

The DGX Software Repository also provides optional theme packages and desktop wallpapers to give the user-interface an NVIDIA look-and-feel for the DGX Station desktop. These packages would have been installed as part of the DGX Station Configurations group.

  1. To apply the theme and background images, first open gnome-tweaks.

  2. Under Applications, select one of the NV-Yaru themes. This comes in default, light, and dark variations.

  3. Under Shell, select the NV-Yaru-dark theme.

    If this field is grayed out, you might need to reboot the system or restart GDM in order to enable the user-themes extension.

  4. To restart GDM, issue the following.

    sudo systemctl restart gdm
    
  5. Select one of the NVIDIA wallpapers for the background image and lock screen.

    _images/desktop-theme-wallpaper.jpg