Installing the DGX Software#

This section requires that you have already installed Red Hat Enterprise Linux or derived operating system on the DGX™ system. You can skip this section of you already installed NVIDIA BaseOS during a kickstart install.

Important

Before performing the installation, refer to the Release Notes for the latest information and additional instructions depending on the specific release.

Configuring a System Proxy#

If your network requires you to use a proxy:

  • Edit the file /etc/dnf/dnf.conf and make sure the following lines are present in the [main] section, using the parameters that apply to your network:

    proxy=http://<Proxy-Server-IP-Address>:<Proxy-Port>
    proxy_username=<Proxy-User-Name>
    proxy_password=<Proxy-Password>
    

Enabling the NVIDIA and DGX Software Repositories and Installing Required Components#

Attention

By running these commands you are confirming that you have read and agree to be bound by the NVIDIA Software License Agreement found on the NVIDIA Enterprise Software page. You are also confirming that you understand that any pre-release software and materials available that you elect to install in a DGX might not be fully functional, might contain errors or design flaws, and might have reduced or different security, privacy, availability, and reliability standards relative to commercial versions of NVIDIA software and materials, and that you use pre-release versions at your own risk.

  1. On Red Hat Enterprise Linux, run the following commands to enable additional repositories required by the DGX software.

    sudo subscription-manager repos --enable=rhel-10-for-x86_64-appstream-rpms
    sudo subscription-manager repos --enable=rhel-10-for-x86_64-baseos-rpms
    sudo subscription-manager repos --enable=codeready-builder-for-rhel-10-x86_64-rpms
    
  2. Install the NVIDIA repository:

    sudo dnf install -y https://repo.download.nvidia.com/baseos/el/el-files/10/nvidia-repositories-25.09-5.el10.x86_64.rpm
    
  3. Install the DGX repository:

    sudo dnf install -y https://repo.download.nvidia.com/baseos/el/el-files/10/dgx-repositories-25.09-2.el10.x86_64.rpm
    
  4. Upgrade to the latest software.

    sudo dnf update -y --nobest
    
    sudo reboot
    
  5. Install kernel-devel and kernel-headers packages:

    sudo dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
    
  6. Install NVIDIA System Core, NVIDIA System Utils, and DGX System Utils:

    sudo dnf group install -y "NVIDIA System Core"
    sudo dnf group install -y "NVIDIA System Utils"
    sudo dnf group install -y "DGX System Utils"
    
  7. Verify that the active profile is the proper dgx-<platform>-performance profile.

    sudo tuned-adm active
    
  8. Install the nvidia-acs-disable package to allow better GPU-direct performance in bare-metal use cases:

    sudo dnf install -y nvidia-acs-disable
    
  9. On Ampere (A100, A800) systems only, install the nvidia-pci-bridge-power package to enable the power control to “on” for all PCI bridges upstream of the GPUs to ensure that firmware updates are successful:

    sudo dnf install nvidia-pci-bridge-power
    

The above configuration changes take effect only after rebooting the system. To reduce the number of reboots, you can defer rebooting until this user guide instructs you to reboot when you are installing DOCA and the GPU driver.

Configuring Data Drives#

The data drives in the DGX systems can be configured as RAID 0 or RAID 5. RAID 0 provides the maximum storage capacity and performance, but does not provide any redundancy.

RAID 0 is often used for data caching. When the NVIDIA System Core group is installed, it installs the nvidia-tuned-profiles package which configures the cachefilesd service to provide a cache for NFS shares.

Important

You can change the RAID level later but this will destroy the data on those drives.

The RAID configuration can be configured during the operating system installation. If you have already configured the RAID array during the installation, you can skip the first step and go to step 2.

  1. Configure the /raid partition.

    All DGX systems support RAID 0 or RAID 5 arrays.

    Run one of the following commands to create a RAID array, mount it to /raid, and create an appropriate entry in /etc/fstab.

    • To create a RAID 0 array:

      sudo /usr/bin/configure_raid_array.py -c -f
      
    • To create a RAID 5 array:

      sudo /usr/bin/configure_raid_array.py -c -f -5
      
    • After creating a RAID array or any time the RAID array is re-created, run the following two commands to restore the RAID array label:

      sudo restorecon /raid
      sudo systemctl restart cachefilesd
      
  2. (Optional) Install tools for managing the self-encrypting drives (SED) for the data drives on DGX A100, DGX A800, or DGX H100/H200/B200 systems.

    Refer to Managing Self-Encrypting Drives for more information.

Enabling Relaxed Ordering for NVMe Drives#

The Samsung NVMe drives used in the NVIDIA DGX systems support relaxed ordering for I/O operations. Relaxed ordering enables the PCIe bus to complete transactions out of order. NVIDIA recommends enabling this setting when you use GPUDirect Storage to improve performance.

To enable relaxed ordering for I/O operations, run the nvidia-relaxed-ordering-nvme.sh utility as follows:

sudo /bin/nvidia-relaxed-ordering-nvme.sh enable

Configuring NVMe Interrupt Coalescing#

The nvidia-nvme-options package, which is installed on all DGX systems, automatically configures NVMe interrupt coalescing on all Samsung and Kioxia drives at each boot. To disable this setting or manually configure the setting, issue the following commands:

To disable the setting:

sudo systemctl stop nvidia-nvme-interrupt-coalescing.service
sudo systemctl disable nvidia-nvme-interrupt-coalescing.service

To configure the setting manually:

sudo /usr/bin/nvidia-nvme-interrupt-coalescing.sh enable
sudo /usr/bin/nvidia-nvme-interrupt-coalescing.sh disable

Installing the GPU Driver#

Note

If you will be installing DOCA, do the steps in the Installing NVIDIA DOCA-OFED section now. DOCA needs to be installed prior to installing the GPU driver; otherwise, nvidia-peermem module will not load.

  1. Depending on the platform type being installed, use one of the following instructions to install the precompiled version of the GPU driver:

    • On NVSwitch systems with fifth-generation NVLinks which include the DGX B200, run the following command to install the NVIDIA Driver:

    sudo dnf install -y nvidia-driver-cuda kmod-nvidia-open-dkms nvidia-fabricmanager nvlsm libnvsdm libnvidia-nscq
    
    • On NVSwitch systems without fifth-generation NVLinks which include DGX H200, DGX H100, DGX A100, and DGX A800, run the following command to install the NVIDIA Driver:

    sudo dnf install nvidia-driver-cuda kmod-nvidia-open-dkms nvidia-fabricmanager libnvidia-nscq
    
  2. The configuration changes take effect only after rebooting the system. To reduce the number of reboots, you can defer rebooting until after you install the NVIDIA Container Runtime group.

  3. Install and configure the NVIDIA Container Toolkit with Podman.

  4. Reboot.

    sudo reboot
    
  5. Verify the installation of the nvidia-driver by running:

    nvidia-smi
    

The nvidia-smi command should return without errors, with output showing the Driver Version, CUDA Version, all available GPUs, etc.

Installing and Running Podman#

To run an NVIDIA container with Podman:

  1. Install Podman.

    sudo dnf install -y podman
    
  2. Update external repositories:

    sudo dnf clean expire-cache
    
  3. Generate the Container Device Interface (CDI) specification file.

    sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
    

    Note: This sample command uses sudo to ensure that the file at /etc/cdi/nvidia.yaml is created. You can omit the --output argument to print the generated specification to STDOUT.

  4. Verify that the GPU drivers are loaded and are handling the NVIDIA devices.

    nvidia-smi -L
    
  5. Run the following command to verify the installation.

    sudo podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
    
  6. Verify your installation by running a sample container with Podman.

    podman run --security-opt=label=disable --rm --device nvidia.com/gpu=all ubuntu nvidia-smi
    

Installing Optional Components#

The DGX is fully functional after installing the components as described in Enabling the NVIDIA and DGX Software Repositories and Installing Required Components. If you intend to launch NGC containers (that incorporate the CUDA toolkit, NCCL, cuDNN, and TensorRT) on the DGX system, which is the expected use case, then you can skip this section.

If you intend to use your DGX as a development system for running deep learning applications on bare metal, then install the optional components as described in this section.

CUDA Toolkit#

Before installing the CUDA Toolkit, ensure that the GPU driver has been installed by doing the steps in the Installing the GPU Driver section above.

Install CUDA Toolkit 13.0 packages:

Note

The output of nvidia-smi shows the version of CUDA that is native-compatible with the installed driver:

NVIDIA-SMI 580.105.08       Driver Version: 580.105.08       CUDA Version: 13.0

It is recommended that you install the CUDA toolkit and compatible packages that match the CUDA Version shown in the nvidia-smi output. For instance, if nvidia-smi output lists “Driver Version” as “580.105.08” and the “CUDA Version” as “13.0”, then specify “13-0” for installing the cuda-toolkit and cuda-compat packages in the install command below. (See Release Notes for more information.)

sudo dnf install -y cuda-toolkit-13-0 cuda-compat-13-0

DCGM#

  • To install and enable the Data Center GPU Manager (DCGM):

sudo dnf install -y datacenter-gpu-manager-4-cuda13
sudo systemctl --now enable nvidia-dcgm

NCCL#

  • To install the NVIDIA Collectives Communication Library (NCCL) Runtime, refer to the NCCL:Getting Started documentation

    sudo dnf group install -y 'NVIDIA Collectives Communication Library Runtime'
    

cuDNN#

  • To install the NVIDIA CUDA Deep Neural Networks (cuDNN) Library Runtime, refer to the NVIDIA cuDNN page.

    sudo dnf group install -y 'NVIDIA CUDA Deep Neural Networks Library Runtime'
    

NVIDIA TensorRT#

GDS#

To install NVIDIA GPUDirect Storage (GDS), perform the following steps.

  1. Ensure that the kernel headers and development packages for your kernel are installed.

    The kernel headers and development packages were installed in the Enabling the NVIDIA and DGX Software Repositories and Installing Required Components section above.

    sudo dnf install -y kernel-headers-$(uname -r) kernel-devel-$(uname -r)
    
  2. Install the GDS package.

    sudo dnf install -y nvidia-gds
    

Refer to Verifying a Successful GDS Installation in the NVIDIA GPUDirect Storage Installation and Troubleshooting Guide.