Installing NVIDIA DOCA-OFED#
The NVIDIA DGX™ Software Stack for Red Hat Enterprise Linux does not include the NVIDIA DOCA™ OFED (OpenFabrics Enterprise Distribution) software for Linux. This is to ensure that the DOCA-OFED software, a subset of the full DOCA package, is in sync with the Red Hat distribution kernel. This topic describes how to download, install, and upgrade the DOCA-OFED software on systems that are running Red Hat Enterprise Linux.
DOCA-Host Installation Profiles#
The DOCA software package contains several subsets called the DOCA-Host installation profiles, which are fully validated and tested installation packages. The following table lists the available DOCA-Host profiles:
DOCA-Host Profile |
Description |
|---|---|
doca-ofed |
Allows you to install the same drivers and tools of MLNX_OFED using the DOCA-Host package, but without other DOCA functionality. |
doca-network |
Intended for users who want to use only the networking functionality of the DOCA-Host package. |
doca-all |
Intended for users who want to use the full extent of DOCA drivers and libraries, the full DOCA-Host installation. |
For more information, refer to NVIDIA DOCA Profiles.
Prerequisites#
Download and install the NVIDIA RPM GPG key.
Download the NVIDIA RPM-GPG-KEY-Mellanox-SHA256 key.
sudo wget http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox-SHA256
Install the key.
sudo rpm --import RPM-GPG-KEY-Mellanox-SHA256
Verify that the key was successfully imported.
sudo rpm -q gpg-pubkey --qf '%{NAME}-%{VERSION}-%{RELEASE}\t%{SUMMARY}\n' | grep Mellanox
Installing DOCA-OFED on Systems with ConnectX-7 Cards or BlueField-3 in NIC Mode#
If your system is equipped with the NVIDIA® BlueField®-3 DPU, ensure that the DPU is set in NIC mode (NIC Mode for BlueField-3) and then proceed with the following instructions. (Refer to DOCA Installation Guide for Linux for more information.)
For more information concerning installing the DOCA drivers and tools, refer to DOCA Installation Guide for Linux.
Install the DOCA-Host software on the host as outlined in the steps below.
To prepare for the installation of the DOCA-Host package using the doca-ofed profile, set up an online installation of DOCA. (Alternatively, to install by downloading a DOCA RPM package, follow instructions found here: NVIDIA DOCA LTS Downloads.)
Set up online installation of DOCA as follows:
sudo echo "[doca] name=DOCA Online Repo baseurl=https://linux.mellanox.com/public/repo/doca/DGX_latest_DOCA/rhel9/x86_64/ enabled=1 gpgcheck=0" > /tmp/doca.repo
sudo mv /tmp/doca.repo /etc/yum.repos.d/doca.repo
sudo chown root.root /etc/yum.repos.d/doca.repo
Clean up temporary repository files and perform an update.
Note
To prevent undesired upgrade of the Linux kernel, such as from RHEL 8.9 to RHEL 8.10, or RHEL 9.6 to 9.7, or RHEL 10.0 to RHEL 10.1, you should pin the desired RHEL release by setting the
--set=<release>option of thesubscription-manager releasecommand.For example, to stay on the RHEL 9.6 release:
subscription-manager release --set=9.6
You should check the Release Notes section for GPU driver and Linux kernel support before changing the
--set=<release>setting and performingsudo dnf update --nobest.sudo dnf clean all -y
sudo dnf update --nobest
sudo dnf makecache -y
Install kernel-modules-extra package.
sudo dnf install -y kernel-modules-extra-$(uname -r)
Determine if the kernel version on your host is supported as shown in Supported Host OS per DOCA-Host Installation Profile.
If the kernel version on your host is not supported, follow the instructions described in DOCA Extra Package and doca-kernel-support.
Run the
dnf installcommand below to install the doca-ofed profile.sudo dnf install -y doca-ofed
Install nvidia-mlnx-config.
sudo dnf install -y nvidia-mlnx-config
Do the following steps if your system includes a BlueField-3 DPU:
If your system includes a BlueField-3 DPU, determine the BlueField-3 device ID one of the following two ways:
mst start mst status -v
/opt/mellanox/doca/tools/doca-info
For more information, refer to DOCA Installation Guide for Linux.
If your system includes a BlueField-3 DPU, use the RShim driver to manage and flash the BlueField-3 DPU.
Refer to BF-Bundle Installation and Upgrade for more information about the RShim driver. (The RShim driver is currently installed when
doca-ofedis installed. With older DOCA releases, it may have been necessary to install the RShim driver separately.)
If your system includes a BlueField-3 DPU, start RShim:
sudo systemctl daemon-reload
sudo systemctl enable rshim
sudo systemctl start rshim
sudo systemctl status rshim
Note
If the output contains “Failed to start rshim driver,” then RShim can be started manually as follows:
sudo /usr/sbin/rshimAfter a reboot, RShim will need to be started manually again the same way:
sudo /usr/sbin/rshimIf your system includes a BlueField-3 DPU, confirm that the NVIDIA BlueField-3 SoC Management Interface is on the system by printing the PCI BDF for the BlueField-3 SoC Management Interface devices:
sudo lspci | grep "BlueField-3 SoC Management Interface"
The output should look similar to the following:
29:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01) aa:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)
If your system includes a BlueField-3 DPU and the BlueField-3 SoC Management Interface is on the system, install the BF-bundle:
sudo bfb-install --rshim rshim<N> --bfb <image_path.bfb>
Where
<N>is the RShim device identifier in/dev/rshim<N>.
The online repo contains the mlnx_fw_updater tool that can be used to update the firmware on ConnectX-7 and BlueField cards. The installation of doca-ofed installs the doca-host package. The doca-host package provides a repo so that
mlnx-fw-updatercan be installed. If you want to update firmware on ConnectX-7 or Bluefield-3 cards in your system, installmlnx-fw-updateras follows:sudo dnf install mlnx-fw-updater
Re-create an initramfs image.
sudo dracut -f
Reboot the system.
sudo systemctl reboot
The
mlnxofed-docsdocumentation can be installed as follows:sudo dnf install mlnxofed-docs
Register your new Red Hat Enterprise Linux system to the Customer Portal using Red Hat Subscription-Manager if you have not already done so.
For more information, refer to How to register and subscribe a RHEL system to the Red Hat Customer Portal using Red Hat Subscription-Manager?.
Additional Information
MFT download instructions: Updating Firmware for a Single Network Interface Card (NIC)
Changing BlueField-3 BMC default password: Changing Default Password
Installing the nvidia-peermem-loader Package#
The nvidia-peermem kernel module registers the NVIDIA GPU with the InfiniBand subsystem by using
peer-to-peer APIs provided by the NVIDIA GPU driver. This module, originally maintained by Mellanox
on GitHub, is now included with the NVIDIA Linux GPU driver. For more information,
refer to Using nvidia-peermem
in the NVIDIA GPUDirect RDMA documentation.
No service automatically loads the nvidia-peermem module. To load the module
automatically at boot, install the NVIDIA peermem loader package (nvidia-peermem-loader).
sudo dnf install nvidia-peermem-loader
This package adds the nvidia-peermem module in /etc/modules-load.d/nvidia-peermem.conf.