Installing NVIDIA DOCA-OFED#
The NVIDIA DGX™ Software Stack for Red Hat Enterprise Linux does not include the NVIDIA DOCA™ OFED (OpenFabrics Enterprise Distribution) software for Linux. This is to ensure that the DOCA-OFED software, a subset of the full DOCA package, is in sync with the Red Hat distribution kernel. This topic describes how to download, install, and upgrade the DOCA-OFED software on systems that are running Red Hat Enterprise Linux.
DOCA-Host Installation Profiles#
The DOCA software package contains several subsets called the DOCA-Host installation profiles, which are fully validated and tested installation packages. The following table lists the available DOCA-Host profiles:
DOCA-Host Profile |
Description |
|---|---|
doca-ofed |
Allows you to install the same drivers and tools of MLNX_OFED using the DOCA-Host package, but without other DOCA functionality. |
doca-network |
Intended for users who want to use only the networking functionality of the DOCA-Host package. |
doca-all |
Intended for users who want to use the full extent of DOCA drivers and libraries, the full DOCA-Host installation. |
For more information, refer to DOCA Profiles.
Installing DOCA-OFED on Systems with ConnectX-7 Cards or BlueField-3 Cards in NIC Mode#
If your system is equipped with the NVIDIA® BlueField®-3 DPU, ensure that the DPU is set in NIC mode. (See NIC Mode for BlueField-3, Identifying Which Mode BlueField is Currently Operating In, and Changing BlueField Mode for more information.)
Follow the instructions below to install DOCA. (For more information concerning installing the DOCA drivers and tools, refer to DOCA Installation Guide for Linux.)
Install DOCA:
sudo dnf install -y doca-ofed
Install kernel-modules-extra package.
sudo dnf install -y kernel-modules-extra-$(uname -r)
Do the following steps if your system includes a BlueField-3 DPU:
If your system includes a BlueField-3 DPU, determine the BlueField-3 device ID using one of the following methods:
Method 1: As described in the NVIDIA BlueField-3 Networking Platform User Guide, the device ID of all [BlueField] DPUs is 41692 [0xA2DC]. To see all BlueField devices, run the following command:
lspci -d :a2dc
The output should look similar to the following:
0006:03:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) 0006:03:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) 0016:03:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01) 0016:03:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)
Method 2: Run
mst startandmst status -vmst start mst status -v
Method 3: Run the DOCA
doca-infotool:/opt/mellanox/doca/tools/doca-info
For more information, refer to DOCA Installation Guide for Linux.
If your system includes a BlueField-3 DPU, use the RShim driver to manage and flash the BlueField-3 DPU.
Refer to Installing Software on BlueField Using BF-Bundle for more information about the RShim driver. (The RShim driver is currently installed when
doca-ofedis installed. With older DOCA releases, it may have been necessary to install the RShim driver separately.)Start RShim:
sudo systemctl daemon-reload
sudo systemctl enable rshim
sudo systemctl start rshim
sudo systemctl status rshim
Note
If the output contains “Failed to start rshim driver,” then RShim can be started manually as follows:
sudo /usr/sbin/rshimAfter a reboot, RShim will need to be started manually again the same way:
sudo /usr/sbin/rshimIf your system includes a BlueField-3 DPU, confirm that the NVIDIA BlueField-3 SoC Management Interface is on the system by printing the PCI BDF for the BlueField-3 SoC Management Interface devices:
sudo lspci | grep "BlueField-3 SoC Management Interface"
The output should look similar to the following:
29:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01) aa:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)
If your system includes a BlueField-3 DPU and the BlueField-3 SoC Management Interface is on the system, install the BF-bundle:
sudo bfb-install --rshim rshim<N> --bfb <image_path.bfb>
Where
<N>is the RShim device identifier in/dev/rshim<N>.
If desired, install the nvidia-mlnx-config package. See Install nvidia-mlnx-config Package For DOCA Performance Improvement for more information.
The online repo contains the mlnx_fw_updater tool that can be used to update the firmware on ConnectX-7 and BlueField cards. The installation of doca-ofed installs the doca-host package. The doca-host package provides a repo so that
mlnx-fw-updatercan be installed. If you want to update firmware on ConnectX-7 or Bluefield-3 cards in your system, installmlnx-fw-updateras follows:sudo dnf install mlnx-fw-updater
Re-create an initramfs image.
sudo dracut -f
Reboot the system.
sudo systemctl reboot
The
mlnxofed-docsdocumentation can be installed as follows:sudo dnf install mlnxofed-docs
Additional Information
MFT download instructions: Updating Firmware for a Single Network Interface Card (NIC)
Changing BlueField-3 BMC default password: Changing Default Password
Install nvidia-mlnx-config Package For DOCA Performance Improvement#
The nvidia-mlnx-config package that is included in the nvidia-driver-local-repo
can be installed to provide better performance on systems where DOCA is installed.
This package does the following two things:
The
setpcicommand is run to setMaxReadReq(MRRS) to an optimum performance setting.On Ampere platforms (DGX A100, DGX A800), the
MAX_ACC_OUT_READPCI parameter is set to the correct value for the firmware to be able to configure the optimum performance setting. It isn’t necessary to set theMAX_ACC_OUT_READPCI parameter on other platform types, since the firmware configures the optimum performance setting withoutMAX_ACC_OUT_READbeing modified.
Install the nvidia-mlnx-config package as follows:
sudo dnf install nvidia-mlnx-config
A reboot is required to incorporate these new settings.