Installing NVIDIA DOCA-OFED#

The NVIDIA DGX™ Software Stack for Red Hat Enterprise Linux does not include the NVIDIA DOCA™ OFED (OpenFabrics Enterprise Distribution) software for Linux. This is to ensure that the DOCA-OFED software, a subset of the full DOCA package, is in sync with the Red Hat distribution kernel. This topic describes how to download, install, and upgrade the DOCA-OFED software on systems that are running Red Hat Enterprise Linux.

DOCA-Host Installation Profiles#

The DOCA software package contains several subsets called the DOCA-Host installation profiles, which are fully validated and tested installation packages. The following table lists the available DOCA-Host profiles:

DOCA-Host Profile

Description

doca-ofed

Allows you to install the same drivers and tools of MLNX_OFED using the DOCA-Host package, but without other DOCA functionality.

doca-network

Intended for users who want to use only the networking functionality of the DOCA-Host package.

doca-all

Intended for users who want to use the full extent of DOCA drivers and libraries, the full DOCA-Host installation.

For more information, refer to DOCA Profiles.

Installing DOCA-OFED on Systems with ConnectX-7 Cards or BlueField-3 Cards in NIC Mode#

If your system is equipped with the NVIDIA® BlueField®-3 DPU, ensure that the DPU is set in NIC mode. (See NIC Mode for BlueField-3, Identifying Which Mode BlueField is Currently Operating In, and Changing BlueField Mode for more information.)

Follow the instructions below to install DOCA. (For more information concerning installing the DOCA drivers and tools, refer to DOCA Installation Guide for Linux.)

  1. Install DOCA:

    sudo dnf install -y doca-ofed
    
  2. Install kernel-modules-extra package.

    sudo dnf install -y kernel-modules-extra-$(uname -r)
    
  3. Do the following steps if your system includes a BlueField-3 DPU:

    1. If your system includes a BlueField-3 DPU, determine the BlueField-3 device ID using one of the following methods:

      Method 1: As described in the NVIDIA BlueField-3 Networking Platform User Guide, the device ID of all [BlueField] DPUs is 41692 [0xA2DC]. To see all BlueField devices, run the following command:

      lspci -d :a2dc
      

      The output should look similar to the following:

      0006:03:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
      0006:03:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
      0016:03:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
      0016:03:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
      

      Method 2: Run mst start and mst status -v

      mst start
      mst status -v
      

      Method 3: Run the DOCA doca-info tool:

      /opt/mellanox/doca/tools/doca-info
      

      For more information, refer to DOCA Installation Guide for Linux.

    2. If your system includes a BlueField-3 DPU, use the RShim driver to manage and flash the BlueField-3 DPU.

      • Refer to Installing Software on BlueField Using BF-Bundle for more information about the RShim driver. (The RShim driver is currently installed when doca-ofed is installed. With older DOCA releases, it may have been necessary to install the RShim driver separately.)

      • Start RShim:

      sudo systemctl daemon-reload
      
      sudo systemctl enable rshim
      
      sudo systemctl start rshim
      
      sudo systemctl status rshim
      

      Note

      If the output contains “Failed to start rshim driver,” then RShim can be started manually as follows:

      sudo /usr/sbin/rshim
      
      • After a reboot, RShim will need to be started manually again the same way:

      sudo /usr/sbin/rshim
      
    3. If your system includes a BlueField-3 DPU, confirm that the NVIDIA BlueField-3 SoC Management Interface is on the system by printing the PCI BDF for the BlueField-3 SoC Management Interface devices:

      sudo lspci | grep "BlueField-3 SoC Management Interface"
      

      The output should look similar to the following:

      29:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)
      aa:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)
      
    4. If your system includes a BlueField-3 DPU and the BlueField-3 SoC Management Interface is on the system, install the BF-bundle:

      sudo bfb-install --rshim rshim<N> --bfb <image_path.bfb>
      

      Where <N> is the RShim device identifier in /dev/rshim<N>.

  4. If desired, install the nvidia-mlnx-config package. See Install nvidia-mlnx-config Package For DOCA Performance Improvement for more information.

  5. The online repo contains the mlnx_fw_updater tool that can be used to update the firmware on ConnectX-7 and BlueField cards. The installation of doca-ofed installs the doca-host package. The doca-host package provides a repo so that mlnx-fw-updater can be installed. If you want to update firmware on ConnectX-7 or Bluefield-3 cards in your system, install mlnx-fw-updater as follows:

    sudo dnf install mlnx-fw-updater
    
  6. Re-create an initramfs image.

    sudo dracut -f
    
  7. Reboot the system.

    sudo systemctl reboot
    
  8. The mlnxofed-docs documentation can be installed as follows:

    sudo dnf install mlnxofed-docs
    

Additional Information

Install nvidia-mlnx-config Package For DOCA Performance Improvement#

The nvidia-mlnx-config package that is included in the nvidia-driver-local-repo can be installed to provide better performance on systems where DOCA is installed. This package does the following two things:

  • The setpci command is run to set MaxReadReq (MRRS) to an optimum performance setting.

  • On Ampere platforms (DGX A100, DGX A800), the MAX_ACC_OUT_READ PCI parameter is set to the correct value for the firmware to be able to configure the optimum performance setting. It isn’t necessary to set the MAX_ACC_OUT_READ PCI parameter on other platform types, since the firmware configures the optimum performance setting without MAX_ACC_OUT_READ being modified.

Install the nvidia-mlnx-config package as follows:

sudo dnf install nvidia-mlnx-config

A reboot is required to incorporate these new settings.