C. Installing Mellanox InfiniBand Drivers

Unlike the DGX OS shipped with the NVIDIA DGX server, the DGX software stack for Red Hat-derived operating systems does not include the Mellanox OpenFabrics Enterprise Distribution (MLNX_OFED) for Linux. This is to avoid an installation where the MLNX_OFED kernel may be out of sync with the Red Hat distribution kernel, resulting in system instability.

To use InfiniBand on the DGX server, do the following.

  1. Either visit the Mellanox site and download and install the latest MLNX_OFED driver, or use the in-box drivers.
    Note: The in-box drivers provide a much lower level of performance than the official Mellanox drivers.
    Be sure that the MLNX_OFED package supports the latest version of the installed Red Hat Enterprise Linux release.
  2. After installing the MLNX_OFED drivers, install the NVIDIA peer memory module.
    sudo yum install nvidia-peer-memory-dkms