Installing NVIDIA DOCA-OFED#

The NVIDIA DGX™ Software Stack for Red Hat Enterprise Linux does not include the NVIDIA DOCA™ OFED (OpenFabrics Enterprise Distribution) software for Linux. This is to ensure that the DOCA-OFED software, a subset of the full DOCA package, is in sync with the Red Hat distribution kernel. This topic describes how to download, install, and upgrade the DOCA-OFED software on systems that are running Red Hat Enterprise Linux.

DOCA-Host Installation Profiles#

The DOCA software package contains several subsets called the DOCA-Host installation profiles, which are fully validated and tested installation packages. The following table lists the available DOCA-Host profiles:

DOCA-Host Profile

Description

doca-ofed

Allows you to install the same drivers and tools of MLNX_OFED using the DOCA-Host package, but without other DOCA functionality.

doca-network

Intended for users who want to use only the networking functionality of the DOCA-Host package.

doca-all

Intended for users who want to use the full extent of DOCA drivers and libraries, the full DOCA-Host installation.

For more information, refer to NVIDIA DOCA Profiles.

Prerequisites#

  1. Before installing a different version of DOCA-OFED software, you must remove the installed DOCA-OFED or MLNX_OFED software on your system.

    • Debian-based Linux

      # Remove the installed DOCA-OFED software.
      $ for f in $( dpkg --list | grep doca | awk '{print $2}' ); do echo $f ; sudo apt remove --purge $f -y ; done
      
      # Remove the installed MLNX_OFED software.
      $ sudo /usr/sbin/ofed_uninstall.sh --force
      
      $ sudo apt-get autoremove
      
    • RPM-based Linux

      # Remove the installed DOCA-OFED software from the host.
      for f in $(rpm -qa | grep -i doca ) ; do sudo dnf -y remove $f; done
      
      # Remove the installed MLNX_OFED software.
      sudo /usr/sbin/ofed_uninstall.sh --force
      
      sudo dnf autoremove
      
      sudo dnf makecache
      
  2. Download and install the NVIDIA RPM GPG key.

    1. Download the NVIDIA RPM-GPG-KEY-Mellanox-SHA256 key.

      sudo wget http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox-SHA256
      
    2. Install the key.

      sudo rpm --import RPM-GPG-KEY-Mellanox-SHA256
      
    3. Verify that the key was successfully imported.

      sudo rpm -q gpg-pubkey --qf '%{NAME}-%{VERSION}-%{RELEASE}\t%{SUMMARY}\n' | grep Mellanox
      

Installing DOCA-OFED on Systems with ConnectX-7 Cards#

  1. Prepare for the installation of the DOCA-Host package using the doca-ofed profile by either setting up online installation of DOCA (preferred method to get the latest version) or downloading the DOCA RPM package:

    1. Set up online installation of DOCA as follows:

      For RHEL 9.6:

      sudo echo "[doca]
      name=DOCA Online Repo
      baseurl=https://linux.mellanox.com/public/repo/doca/2.9.3/rhel9.6/x86_64/
      enabled=1
      gpgcheck=0" > /tmp/doca.repo
      
      sudo mv /tmp/doca.repo /etc/yum.repos.d/doca.repo
      
      sudo chown root.root /etc/yum.repos.d/doca.repo
      

      For RHEL 9.5:

      sudo echo "[doca]
      name=DOCA Online Repo
      baseurl=https://linux.mellanox.com/public/repo/doca/DGX_latest_DOCA/rhel9.5/x86_64/
      enabled=1
      gpgcheck=0" > /tmp/doca.repo
      
      sudo mv /tmp/doca.repo /etc/yum.repos.d/doca.repo
      
      sudo chown root.root /etc/yum.repos.d/doca.repo
      
    2. Steps for the download method of installing DOCA:

      • Download the desired DOCA RPM package from one of the following places:

        1. Open the Installation Files page, choose and download the DOCA-Host installation file based on the OS and Arch options you want and download it.

        2. Download the doca-ofed RHEL-Rocky DOCA RPM installation package from the DOCA Downloads https://developer.nvidia.com/doca-downloads page.

        3. To obtain a previous DOCA release doca-ofed RHEL-Rocky DOCA RPM installation package, download the desired doca-ofed from the NVIDIA DOCA Downloads and Documentation https://developer.nvidia.com/doca-archive page.

      • Unpack the RPM package.

        sudo rpm -Uvh <repo_file>.rpm
        
  2. Clean up temporary repository files and perform an update.

    Note

    To prevent undesired upgrade of the Linux kernel, such as from RHEL 9.5 to 9.6, you should pin the desired RHEL release by setting the --set=<release> option of the subscription-manager release command.

    For example, to stay on the RHEL 9.5 release:

    subscription-manager release --set=9.5
    

    You should check the Release Notes section for GPU driver and Linux kernel support before changing the -set=<release> setting and performing sudo dnf update --nobest.

    sudo dnf clean all -y
    
    sudo dnf update --nobest
    
    sudo dnf makecache -y
    
  3. Install kernel-modules-extra package.

    sudo dnf install -y kernel-modules-extra-$(uname -r)
    
  4. Determine if the kernel version on your host is supported as shown in Supported Host OS per DOCA-Host Installation Profile.

    If the kernel version is not supported, follow the instructions described in DOCA Extra Package.

  5. Run the dnf install command for the doca-ofed profile installation.

    sudo dnf install -y doca-ofed
    
  6. The online repo contains the mlnx_fw_updater tool. The installation of doca-ofed installs the doca-host package. The doca-host package provides a repo so that mlnx-fw-updater can be installed. Install mlnx-fw-updater as follows:

    sudo dnf install mlnx-fw-updater
    
  7. The mlnxofed-docs documentation can be installed as follows:

    sudo dnf install mlnxofed-docs
    
  8. Re-create an initramfs image.

    sudo dracut -f
    
  9. Reboot the system.

    sudo systemctl reboot
    
  10. Register your new Red Hat Enterprise Linux system to the Customer Portal using Red Hat Subscription-Manager if you have not already done so.

    For more information, refer to How to register and subscribe a RHEL system to the Red Hat Customer Portal using Red Hat Subscription-Manager?.

  11. The online repo contains the mlnx_fw_updater tool. The installation of doca-ofed installs the doca-host package. The doca-host package provides a repo so that mlnx-fw-updater can be installed. Install mlnx-fw-updater as follows:

    sudo dnf install mlnx-fw-updater
    
  12. The mlnxofed-docs documentation can be installed as follows:

    sudo dnf install mlnxofed-docs
    

For more information about the doca-ofed profile installation on the host, refer to Installing Software on Host.

Installing DOCA-OFED on Systems with BlueField-3 in NIC Mode#

If your system is equipped with the NVIDIA® BlueField®-3 DPU, ensure that the DPU is set in NIC mode (NIC Mode for BlueField-3) and then proceed with the following instructions. (See DOCA Installation Guide for Linux for more information.)

  1. Determine the BlueField-3 device ID.

    Follow the instructions described in Determining BlueField Device ID.

  2. Install the DOCA-Host software on the host.

    Install DOCA as outlined in the steps below. For more information concerning installing the DOCA drivers and tools, see: Installing Software on Host.

  3. Prepare for the installation of the DOCA-Host package using the doca-ofed profile by either setting up online installation of DOCA (preferred method to get the latest version) or downloading the DOCA RPM package:

    1. Set up online installation of DOCA as follows:

      For RHEL 9.6:

      sudo echo "[doca]
      name=DOCA Online Repo
      baseurl=https://linux.mellanox.com/public/repo/doca/2.9.3/rhel9.6/x86_64/
      enabled=1
      gpgcheck=0" > /tmp/doca.repo
      
      sudo mv /tmp/doca.repo /etc/yum.repos.d/doca.repo
      
      sudo chown root.root /etc/yum.repos.d/doca.repo
      

      For RHEL 9.5:

      sudo echo "[doca]
      name=DOCA Online Repo
      baseurl=https://linux.mellanox.com/public/repo/doca/DGX_latest_DOCA/rhel9.5/x86_64/
      enabled=1
      gpgcheck=0" > /tmp/doca.repo
      
      sudo mv /tmp/doca.repo /etc/yum.repos.d/doca.repo
      
      sudo chown root.root /etc/yum.repos.d/doca.repo
      
    2. Steps for the download method of installing DOCA:

      • Download the desired DOCA RPM package from one of the following places:

        1. Open the Installation Files page, choose and download the DOCA-Host installation file based on the OS and Arch options you want and download it.

        2. Download the doca-ofed RHEL-Rocky DOCA RPM installation package from the DOCA Downloads https://developer.nvidia.com/doca-downloads page.

        3. To obtain a previous DOCA release doca-ofed RHEL-Rocky DOCA RPM installation package, download the desired doca-ofed from the NVIDIA DOCA Downloads and Documentation https://developer.nvidia.com/doca-archive page.

      • Unpack the RPM package.

        sudo rpm -Uvh <repo_file>.rpm
        
  4. Clean up temporary repository files and perform an update.

    Note

    To prevent undesired upgrade of the Linux kernel, such as from RHEL 9.5 to 9.6, you should pin the desired RHEL release by setting the --set=<release> option of the subscription-manager release command.

    For example, to stay on the RHEL 9.5 release:

    subscription-manager release --set=9.5
    

    You should check the Release Notes section for GPU driver and Linux kernel support before changing the -set=<release> setting and performing sudo dnf update --nobest.

    sudo dnf clean all -y
    
    sudo dnf update --nobest
    
    sudo dnf makecache -y
    
  5. Install kernel-modules-extra package.

    sudo dnf install -y kernel-modules-extra-$(uname -r)
    
  6. Determine if the kernel version on your host is supported as shown in Supported Host OS per DOCA-Host Installation Profile.

    If the kernel version is not supported, follow the instructions described in DOCA Extra Package.

  7. Run the dnf install command for the doca-ofed profile installation.

    sudo dnf install -y doca-ofed
    
  8. Install the RShim driver to manage and flash the BlueField-3 DPU.

    Follow the procedure described in Installing Prerequisites on Host for Target BlueField.

    • Choose the procedure for the RPM-based Linux.

  9. Start RShim:

    sudo systemctl daemon-reload
    
    sudo systemctl enable rshim
    
    sudo systemctl start rshim
    
    sudo systemctl status rshim
    

    Note

    If the output contains “Failed to start rshim driver”, then RShim can be started manually as follows:

    sudo /usr/sbin/rshim
    
    • After a reboot, RShim will need to be started manually again the same way:

    sudo /usr/sbin/rshim
    
  10. To confirm that the NVIDIA BlueField-3 SoC Management Interface is on the system, run the following to print the PCI BDF for the BlueField-3 Soc Management Interface devices:

    sudo lspci | grep "BlueField-3 SoC Management Interface"
    

    The output should look similar to the following:

    29:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)
    aa:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)
    
  11. The online repo contains the mlnx_fw_updater tool. The installation of doca-ofed installs the doca-host package. The doca-host package provides a repo so that mlnx-fw-updater can be installed. Install mlnx-fw-updater as follows:

    sudo dnf install mlnx-fw-updater
    

    Note

    This will update the firmware on the BlueField cards.

  12. If the BlueField-3 SoC Management Interface is on the system, install the BF-bundle:

    sudo bfb-install --rshim rshim<N> --bfb <image_path.bfb>
    

    Where <N> is the RShim device identifier (/dev/rshimN).

  13. The mlnxofed-docs documentation can be installed as follows:

    sudo dnf install mlnxofed-docs
    
  14. Re-create an initramfs image.

    sudo dracut -f
    
  15. Reboot the system.

    sudo systemctl reboot
    
  16. Register your new Red Hat Enterprise Linux system to the Customer Portal using Red Hat Subscription-Manager if you have not already done so.

    For more information, refer to How to register and subscribe a RHEL system to the Red Hat Customer Portal using Red Hat Subscription-Manager?.

Additional Information

Installing the nvidia-peermem-loader Package#

The nvidia-peermem kernel module registers the NVIDIA GPU with the InfiniBand subsystem by using peer-to-peer APIs provided by the NVIDIA GPU driver. This module, originally maintained by Mellanox on GitHub, is now included with the NVIDIA Linux GPU driver. For more information, refer to Using nvidia-peermem in the NVIDIA GPUDirect RDMA documentation.

No service automatically loads the nvidia-peermem module. To load the module automatically at boot, install the NVIDIA peermem loader package (nvidia-peermem-loader).

sudo dnf install nvidia-peermem-loader

This package adds the nvidia-peermem module in /etc/modules-load.d/nvidia-peermem.conf.