Installing NVIDIA DOCA-OFED#

The NVIDIA DGX™ Software Stack for Red Hat Enterprise Linux does not include the NVIDIA DOCA™ OFED (OpenFabrics Enterprise Distribution) software for Linux. This is to ensure that the DOCA-OFED software, a subset of the full DOCA package, is in sync with the Red Hat distribution kernel. This topic describes how to download, install, and upgrade the DOCA-OFED software on systems that are running Red Hat Enterprise Linux.

DOCA-Host Installation Profiles#

The DOCA software package contains several subsets called the DOCA-Host installation profiles, which are fully validated and tested installation packages. The following table lists the available DOCA-Host profiles:

DOCA-Host Profile

Description

doca-ofed

Allows you to install the same drivers and tools of MLNX_OFED using the DOCA-Host package, but without other DOCA functionality.

doca-network

Intended for users who want to use only the networking functionality of the DOCA-Host package.

doca-all

Intended for users who want to use the full extent of DOCA drivers and libraries, the full DOCA-Host installation.

For more information, refer to NVIDIA DOCA Profiles.

Prerequisites#

  1. Download and install the NVIDIA RPM GPG key.

    1. Download the NVIDIA RPM-GPG-KEY-Mellanox-SHA256 key.

      sudo wget http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox-SHA256
      
    2. Install the key.

      sudo rpm --import RPM-GPG-KEY-Mellanox-SHA256
      
    3. Verify that the key was successfully imported.

      sudo rpm -q gpg-pubkey --qf '%{NAME}-%{VERSION}-%{RELEASE}\t%{SUMMARY}\n' | grep Mellanox
      

Installing DOCA-OFED on Systems with ConnectX-7 Cards or BlueField-3 in NIC Mode#

If your system is equipped with the NVIDIA® BlueField®-3 DPU, ensure that the DPU is set in NIC mode (NIC Mode for BlueField-3) and then proceed with the following instructions. (Refer to DOCA Installation Guide for Linux for more information.)

For more information concerning installing the DOCA drivers and tools, refer to DOCA Installation Guide for Linux.

  1. Install the DOCA-Host software on the host as outlined in the steps below.

  2. To prepare for the installation of the DOCA-Host package using the doca-ofed profile, set up an online installation of DOCA. (Alternatively, to install by downloading a DOCA RPM package, follow instructions found here: NVIDIA DOCA LTS Downloads.)

    1. Set up online installation of DOCA as follows:

      sudo echo "[doca]
      name=DOCA Online Repo
      baseurl=https://linux.mellanox.com/public/repo/doca/DGX_latest_DOCA/rhel9/x86_64/
      enabled=1
      gpgcheck=0" > /tmp/doca.repo
      
      sudo mv /tmp/doca.repo /etc/yum.repos.d/doca.repo
      
      sudo chown root.root /etc/yum.repos.d/doca.repo
      
  3. Clean up temporary repository files and perform an update.

    Note

    To prevent undesired upgrade of the Linux kernel, such as from RHEL 8.9 to RHEL 8.10, or RHEL 9.6 to 9.7, or RHEL 10.0 to RHEL 10.1, you should pin the desired RHEL release by setting the --set=<release> option of the subscription-manager release command.

    For example, to stay on the RHEL 9.6 release:

    subscription-manager release --set=9.6
    

    You should check the Release Notes section for GPU driver and Linux kernel support before changing the --set=<release> setting and performing sudo dnf update --nobest.

    sudo dnf clean all -y
    
    sudo dnf update --nobest
    
    sudo dnf makecache -y
    
  4. Install kernel-modules-extra package.

    sudo dnf install -y kernel-modules-extra-$(uname -r)
    
  5. Determine if the kernel version on your host is supported as shown in Supported Host OS per DOCA-Host Installation Profile.

    If the kernel version on your host is not supported, follow the instructions described in DOCA Extra Package and doca-kernel-support.

  6. Run the dnf install command below to install the doca-ofed profile.

    sudo dnf install -y doca-ofed
    
  7. Install nvidia-mlnx-config.

    sudo dnf install -y nvidia-mlnx-config
    
  8. Do the following steps if your system includes a BlueField-3 DPU:

    1. If your system includes a BlueField-3 DPU, determine the BlueField-3 device ID one of the following two ways:

      mst start
      mst status -v
      
      /opt/mellanox/doca/tools/doca-info
      

      For more information, refer to DOCA Installation Guide for Linux.

    2. If your system includes a BlueField-3 DPU, use the RShim driver to manage and flash the BlueField-3 DPU.

      • Refer to BF-Bundle Installation and Upgrade for more information about the RShim driver. (The RShim driver is currently installed when doca-ofed is installed. With older DOCA releases, it may have been necessary to install the RShim driver separately.)

    3. If your system includes a BlueField-3 DPU, start RShim:

      sudo systemctl daemon-reload
      
      sudo systemctl enable rshim
      
      sudo systemctl start rshim
      
      sudo systemctl status rshim
      

      Note

      If the output contains “Failed to start rshim driver,” then RShim can be started manually as follows:

      sudo /usr/sbin/rshim
      
      • After a reboot, RShim will need to be started manually again the same way:

      sudo /usr/sbin/rshim
      
    4. If your system includes a BlueField-3 DPU, confirm that the NVIDIA BlueField-3 SoC Management Interface is on the system by printing the PCI BDF for the BlueField-3 SoC Management Interface devices:

      sudo lspci | grep "BlueField-3 SoC Management Interface"
      

      The output should look similar to the following:

      29:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)
      aa:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)
      
    5. If your system includes a BlueField-3 DPU and the BlueField-3 SoC Management Interface is on the system, install the BF-bundle:

      sudo bfb-install --rshim rshim<N> --bfb <image_path.bfb>
      

      Where <N> is the RShim device identifier in /dev/rshim<N>.

  9. The online repo contains the mlnx_fw_updater tool that can be used to update the firmware on ConnectX-7 and BlueField cards. The installation of doca-ofed installs the doca-host package. The doca-host package provides a repo so that mlnx-fw-updater can be installed. If you want to update firmware on ConnectX-7 or Bluefield-3 cards in your system, install mlnx-fw-updater as follows:

    sudo dnf install mlnx-fw-updater
    
  10. Re-create an initramfs image.

    sudo dracut -f
    
  11. Reboot the system.

    sudo systemctl reboot
    
  12. The mlnxofed-docs documentation can be installed as follows:

    sudo dnf install mlnxofed-docs
    
  13. Register your new Red Hat Enterprise Linux system to the Customer Portal using Red Hat Subscription-Manager if you have not already done so.

    For more information, refer to How to register and subscribe a RHEL system to the Red Hat Customer Portal using Red Hat Subscription-Manager?.

Additional Information

Installing the nvidia-peermem-loader Package#

The nvidia-peermem kernel module registers the NVIDIA GPU with the InfiniBand subsystem by using peer-to-peer APIs provided by the NVIDIA GPU driver. This module, originally maintained by Mellanox on GitHub, is now included with the NVIDIA Linux GPU driver. For more information, refer to Using nvidia-peermem in the NVIDIA GPUDirect RDMA documentation.

No service automatically loads the nvidia-peermem module. To load the module automatically at boot, install the NVIDIA peermem loader package (nvidia-peermem-loader).

sudo dnf install nvidia-peermem-loader

This package adds the nvidia-peermem module in /etc/modules-load.d/nvidia-peermem.conf.