Using the NVIDIA Mellanox InfiniBand Drivers

The DGX software stack for Red Hat-derived operating systems does not include the Mellanox OpenFabrics Enterprise Distribution (MLNX_OFED) for Linux. This is to ensure that the MLNX_OFED driver is in sync with the Red Hat distribution kernel. This section describes how to download, install, and upgrade MLNX_OFED on systems that are running Red Hat Enterprise Linux.

Determining the MLNX_OFED Version to Install

NVIDIA validates each release of NVIDIA EL7 software with a specific MLNX_OFED version. Consult the NVIDIA EL7 release notes for the recommended MLNX_OFED version to install for a particular version of NVIDIA EL7 software.

The following table provides a quick reference for tested versions.

Red Hat Enterprise Linux Version MLNX_OFED LTS Version
7.9 4.9-2.2.6.0
7.8 4.9-0.1.7.0
7.7 4.7-3.2.9.0

Installing the NVIDIA Mellanox InfiniBand Drivers

This section describes how to install MLNX_OFED on systems that do not yet have it installed. It is imperative that a validated MLNX_OFED version is used for the RHEL version that the DGX system is running. Note that the “yum update” command that is run before installing the NVIDIA driver will update the system to the latest Red Hat Enterprise Linux version.

  1. Determine which version of Red Hat Enterprise Linux is installed on the DGX system. cat /etc/redhat-release
  2. Determine the appropriate MLNX_OFED software bundle to install..

    Refer to Determining the MLNX_OFED Version to Install.

  3. Download the MLNX_OFED software bundle.
    1. Visit the Linux InfiniBand Drivers page, scroll down to the Download wizard, and then click the LTS Download tab.



      NVIDIA EL7 software is tested only with LTS versions of MLNX_OFED.

    2. At the MLNX_OFED Download Center matrix, choose
      • The version to install (you may need to select Archive Versions),
      • RHEL/CentOS (under OS Distribution), and
      • The relevant OS Distribution Version and Architecture.




    3. Click the desired ISO/tgz package.

      To obtain the download link, accept the End User License Agreement.

  4. After downloading the correct MLNX_OFED software bundle, proceed with the installation steps.
    1. Re-visit the MLNX_OFED Software Releases site and select the MLNX_OFED software version you intend to use.
    2. Use the side menu to navigate to Installation->Installing MLNX_OFED, and follow the instructions.
  5. Install nvidia-mlnx-config.
    sudo yum install -y nvidia-mlnx-config
  6. Install kernel headers and development packages for your kernel.

    These are needed for the ensuing DKMS compilation.

    sudo yum install -y kernel-headers-$(uname -r) kernel-devel-$(uname -r)
  7. After installing the MLNX_OFED drivers, install the NVIDIA peer memory module.
    sudo yum install -y nvidia-peer-memory-dkms
  8. Load the nv_peer_mem module, either
    • Manually, by issuing sudo systemctl start nv_peer_mem, or
    • Set up the system to start it automatically on every system boot as follows.
      1. Create a file /etc/modules-load.d/nv-peer-mem.conf with contents "nv_peer_mem".
      2. Issue sudo dracut --force /boot/initramfs-$(uname -r).img $(uname -r)
      3. Reboot the system.
Note: While in-box drivers may be available, using the in-box drivers is not recommended as they provide lower performance than the official MLNX OFED drivers and they do not support the GPUDirectTM RDMA feature. For more information on configuring the in-box drivers, see the following Red Hat Enterprise Linux documentation:

Updating the NVIDIA Mellanox InfiniBand Drivers

This section describes how to update MLNX_OFED on systems that already have it installed. The Mellanox InfiniBand Drivers in RPM packages are precompiled for a specific kernel version. It is imperative that a validated MLNX_OFED version is used for the Red Hat Enterprise Linux version that the DGX system has been updated to. There is no need to uninstall the current MLNX_OFED first, because the "mlnxofedinstall" script will automatically uninstall any previously installed versions.

Note: The MLNX_OFED drivers support Red Hat Enterprise Linux weak-modules script. This means that any updates to the kernel within the same Red Hat Enterprise Linux version (for example, 7.9) will not require an update to the MLNX_OFED driver.
  1. Upgrade the Red Hat Enterprise Linux release and kernel version.
    sudo yum update
  2. Determine which version of Red Hat Enterprise Linux is installed on the DGX system. cat /etc/redhat-release
  3. Determine the appropriate MLNX_OFED software bundle to install.

    Refer to Determining the MLNX_OFED Version to Install.

  4. Download the MLNX_OFED software bundle.
    1. Visit the Linux InfiniBand Drivers page, scroll down to the Download wizard, and then click the LTS Download tab.



      NVIDIA EL7 software is tested only with LTS versions of MLNX_OFED.

    2. At the MLNX_OFED Download Center matrix, choose
      • The version to install (you may need to select Archive Versions),
      • RHEL/CentOS (under OS Distribution), and
      • The relevant OS Distribution Version and Architecture.




    3. Click the desired ISO/tgz package.

      To obtain the download link, accept the End User License Agreement.

  5. Mount the downloaded ISO somewhere on the system.

    The following example shows the ISO being mounted on the /mnt directory.

    sudo mount MLNX_OFED_LINUX-<version>.iso /mnt
  6. Prepare to install the driver.
    1. Remove nvidia-mlnx-config and nvidia-peer-memory-dkms.
      sudo dnf remove -y nvidia-mlnx-config nvidia-peer-memory-dkms
      The mlnxofedinstall step will remove packages prior to installing new ones, causing nvidia-mlnx-config and nvidia-peer-memory-dkms to fall out because they depend on some of these removed packages. Removing those components ahead of time avoids issues. These will be reinstalled as a final step.
    2. Specify the new kernel version to use when installing the driver.
      NEXTKERNEL=$(sudo grubby --default-kernel | sed 's/.*vmlinuz\-//g')
  7. Install the driver with the -k and -s flags to specify the new kernel version and kernel source path.
    sudo /mnt/mlnxofedinstall -k ${NEXTKERNEL} -s /lib/modules/${NEXTKERNEL}/build --force
    
    Note: The system may report that additional software needs to be installed before performing the installation. If such a message appears, install the software and then retry installing the MLNX_OFED driver.
  8. Reboot.
    sudo reboot
  9. Reinstall nvidia-mlnx-config and nvidia-peer-memory-dkms.
    sudo yum install -y nvidia-mlnx-config nvidia-peer-memory-dkms
  10. Load the nv_peer_mem module, either
    • Manually, by issuing sudo systemctl start nv_peer_mem, or
    • Set up the system to start it automatically on every system boot as follows.
      1. Create a file /etc/modules-load.d/nv-peer-mem.conf with contents "nv_peer_mem".
      2. Issue sudo dracut --force /boot/initramfs-$(uname -r).img $(uname -r)
      3. Reboot the system.