Installing NVIDIA MLNX_OFED
The DGX software stack for Red Hat Enterprise Linux does not include the NVIDIA MLNX_OFED (OpenFabrics Enterprise Distribution) for Linux. This is to ensure that the MLNX_OFED driver is in sync with the Red Hat distribution kernel. This section describes how to download, install, and upgrade MLNX_OFED on systems that are running Red Hat Enterprise Linux.
Prerequisites
NVIDIA validates each release of NVIDIA DGX Software for Red Hat Enterprise Linux with a specific MLNX_OFED version. Refer to the Release Notes for the recommended MLNX_OFED version to install.
Installing and Configuring MLNX_OFED
This section describes how to install MLNX_OFED on systems that do not yet have it installed. It is imperative that a validated MLNX_OFED version is used for the RHEL version that the DGX system is running.
Important
Running the dnf update
command at any time to install the drivers can update
the system to the latest Red Hat Enterprise Linux version.
Determine which version of Red Hat Enterprise Linux is installed on the DGX system.
cat /etc/redhat-release
After referring to the release notes, download the MLNX_OFED software bundle.
Go to the Linux InfiniBand Drivers page, and scroll down to the MLNX_OFED Download Center matrix.
At the MLNX_OFED Download Center matrix, choose the MLNX_OFED version, OS distribution and distribution version, and architecture to show the software package and documentation. For example,
Version: 23.10-1.1.9.0-LTS
OS Distribution: RHEL/CentOS/Rocky
OS Distribution Version: RHEL/Rocky 9.2
Architecture: x86_64
Click the supported ISO or tgz package.
The Mellanox OFED (MLNX_OFED) Software: End-User Agreement page appears.
Accept the End User License Agreement by clicking I Have Read the Above End User License Agreement.
The selected software package starts to download.
After downloading the correct MLNX_OFED software package, proceed with the installation steps.
For issues during RHEL 9.2 install using MODEF mlnxofedinstall, refer to the Known Issues MOFED mlnxofedinstall reports “Current operation system is not supported” using RHEL 9.2.
Go to the MLNX_OFED Software Releases site and select the MLNX_OFED software version that you downloaded.
Click the User Manual link and then navigate to Installation > Installing MLNX_OFED.
Follow the installation instructions.
Note
The system might report that additional software needs to be installed before performing the installation. If such a message appears, install the software and then retry installing the MLNX_OFED driver.
If you intend to use NVIDIA GPUDirect Storage (GDS), enable the driver’s GDS support according to the instructions at MLNX_OFED Requirements and Installation.
Install
nvidia-mlnx-config
.sudo dnf install -y nvidia-mlnx-config
Install kernel headers and development packages for your kernel.
These are needed for the ensuing DKMS compilation.
sudo dnf install -y kernel-headers-$(uname -r) kernel-devel-$(uname -r)
After installing the MLNX_OFED drivers, install the NVIDIA peer memory module.
sudo dnf install -y nvidia-peer-memory-dkms
Note
nvidia-peer-memory-dkms version 1.2 or later, requires MOFED version 5.4-3.0.3.0 or later. Using MOFED versions that do not fit this criteria will result in a build failure of the nv_peer_mem DKMS module. For more information, see: https://github.com/Mellanox/nv_peer_memory/issues/94#
Note
While in-box drivers might be available, using the in-box drivers is not recommended as they provide lower performance than the official MLNX OFED drivers and they do not support the GPUDirectTM RDMA feature. For more information on configuring the in-box drivers, see the following Red Hat Enterprise Linux documentation: Configuring InfiniBand and RDMA Networks.
Updating NVIDIA MLNX_OFED
This section describes how to update MOFED on systems that already have it installed. The Mellanox InfiniBand Drivers in RPM packages are precompiled for a specific kernel version. Again – it is imperative that the correct MOFED version is used for the RHEL version that the DGX system has been updated to. There is no need to uninstall the current MOFED first, because the “mlnxofedinstall” script will automatically uninstall any previously installed versions.
Upgrade the Red Hat Enterprise Linux release and kernel version.
sudo dnf update --nobest
Determine which version of Red Hat Enterprise Linux is installed on the DGX system.
cat /etc/redhat-release
After referring to the release notes, download the MLNX_OFED software bundle.
Go to the Linux InfiniBand Drivers page, and scroll down to the MLNX_OFED Download Center matrix.
At the MLNX_OFED Download Center matrix, choose the MLNX_OFED version, OS distribution and distribution version, and architecture to show the software package and documentation. For example,
Version: 23.10-1.1.9.0-LTS
OS Distribution: RHEL/CentOS/Rocky
OS Distribution Version: RHEL/Rocky 9.2
Architecture: X86_64
Click the supported ISO or tgz package.
The Mellanox OFED (MLNX_OFED) Software: End-User Agreement page appears.
Accept the End User License Agreement by clicking I Have Read the Above End User License Agreement.
The selected software package starts to download.
Mount the downloaded ISO on the system.
The following example shows the ISO being mounted on the
/mnt
directory.sudo mount MLNX_OFED_LINUX-<version>.iso /mnt
Prepare to install the driver.
Remove
nvidia-mlnx-config
andnvidia-peer-memory-dkms
.sudo dnf remove -y nvidia-mlnx-config nvidia-peer-memory-dkms
The
mlnxofedinstall
step will remove packages prior to installing new ones, causingnvidia-mlnx-config
andnvidia-peer-memory-dkms
to fall out because they depend on some of these removed packages. Removing those components ahead of time avoids issues. These will be reinstalled as a final step.Specify the new kernel version to use when installing the driver.
NEXTKERNEL=$(sudo grubby --default-kernel | sed 's/.*vmlinuz\-//g')
Install the driver with the
-k
and-s
flags to specify the new kernel version and kernel source path.sudo /mnt/mlnxofedinstall -k ${NEXTKERNEL} -s /lib/modules/${NEXTKERNEL}/build --force
Note
Note: The system might report that additional software needs to be installed before performing the installation. If such a message appears, install the software and then repeat this step.
Reboot.
sudo reboot
Reinstall
nvidia-mlnx-config
andnvidia-peer-memory-dkms
.sudo dnf install -y nvidia-mlnx-config nvidia-peer-memory-dkms