NVIDIA MLNX_OFED Documentation v24.01-0.3.3.1
NVIDIA MLNX_OFED Documentation v24.01-0.3.3.1

Installation Related Issues

Issue

Cause

Solution

Driver installation fails.

The install script may fail for the following reasons:

  • Using an unsupported installation option

  • Failed to uninstall the previous installation due to dependencies being used

  • The operating system is not supported

  • The kernel is not supported. You can run mlnx_add_kernel_sup- port.sh in order to to generate a MLNX-OFED package with drivers for the kernel

  • Required packages for installing the driver are missing

  • Missing kernel backport support for non supported kernel

  • Use only supported installation options. The full list of installation options case be displayed on screen by using: mlnxofedinstall --h

  • Run 'rpm -e' to display a list of all RPMs and then manually uninstall them if the preliminary uninstallation failed due to dependencies being used.

  • Use a supported operating system and kernel

  • Manually install the missing packages listed on screen by the installation script if the installation failed due to missing prerequisites.

After driver installation, the openibd service fail to start. This message is logged by the driver: Unknown symbol

The driver was installed on top of an existing In-box driver.

  1. Uninstall the MLNX_OFED driver.

    ofed_uninstall.sh

  2. Reboot the server.

  3. Search for any remaining installed driver.

    If found, move them to the /tmp directory from the current directory.

  4. Re-install the MLNX_OFED driver.

  5. Restart the openibd service.

Warning

This section is relevant for RedHat and SLES distributions only.

Overview

MLNX_OFED package for RedHat comes with RPMs that support KMP (weak-modules), meaning that when a new errata kernel is installed, compatibility links will be created under the weak-updates directory for the new kernel. Those links allow using the existing MLNX_OFED kernel modules without the need for recompilation. However, at times, the ABI of the new kernel may not be compatible with the MLNX_OFED modules, which will prevent loading them. In this case, the MLNX_OFED modules must be rebuilt against the new kernel.

Detecting ABI Incompatibility with MLNX_OFED Modules

When MLNX_OFED modules are not compatible with a new kernel from a new OS or errata kernel, no links will be created under the weak-updates directory for the new kernel, causing the driver load to fail. Checking for the existence of needed module links under weak-updates directory can be done by reloading the MLNX_OFED modules. If one or more modules are missing, the driver reload will fail with an error message.

Example:

Copy
Copied!
            

******************************************************************************** # /etc/init.d/openibd restart Unloading HCA driver: [ OK ] Loading HCA driver and Access Layer: [ OK ] Module rdma_cm belong to kernel which is not a part of MLNX[FAILED]kipping... Loading rdma_ucm [FAILED] ********************************************************************************


Resolving ABI Incompatibility with MLNX_OFED Modules

In order to fix ABI incompatibility with MLNX_OFED modules, the modules should be recompiled against the new kernel, using the mlnx_add_kernel_support.sh script, available in MLNX_OFED installation image.
There are two ways to recompile the MLNX_OFED modules:

  1. Local recompilation and installation on one server.
    Run the mlnxofedinstall command to recompile the kernel modules and reinstall the whole MLNX_OFED on the server. Mount MLNX_OFED ISO image or extract the TGZ file:

    Copy
    Copied!
                

    # cd <MLNX_OFED dir> # ./mlnxofedinstall --skip-distro-check --add-kernel-support --kmp --force

    Notes:

    - The --kmp flag will enable rebuilding RPMs with KMP (weak-updates) support for the new kernel. Therefore, in the next OS/kernel update, the same modules can be used with the new kernel (assuming that the ABI compatibility was not broken again).

    - The command above will rebuild only the kernel RPMs (using mlnx_add_kernel_support.sh), and will save the resulting MLNX_OFED package under /tmp and start installing it automatically. This package can be used for installation on other servers using regular mlnxofedinstall command or yum.

  2. Preparing a new image on one server and deploying it on the cluster.

    1. Use the mlnx_add_kernel_support.sh script directly only to rebuild the kernel RPMs (without running any installations) on one server. Mount MLNX_OFED ISO image or extract the TGZ file:

      Copy
      Copied!
                  

      # cd <MLNX_OFED dir> # ./mlnx_add_kernel_support.sh -m $PWD --kmp -y

      Note: This command will save the resulting MLNX_OFED package under /tmp.

      Example:

      Copy
      Copied!
                  

      ******************************************************************************** # cd /tmp/MLNX_OFED_LINUX-3.3-1.0.0.0-DB-rhel7.0-x86_64 # ./mlnx_add_kernel_support.sh -m $PWD --kmp -y Note: This program will create MLNX_OFED_LINUX TGZ for rhel7.1 under /tmp directory. See log file /tmp/mlnx_ofed_iso.23852.log   Building OFED RPMS . Please wait... Creating metadata-rpms for 3.10.0-229.14.1.el7.x86_64 ... WARNING: Please note that this MLNX_OFED repository contains an unsigned rpms, WARNING: therefore, you should set 'gpgcheck=0' in the repo conf file. Created /tmp/MLNX_OFED_LINUX-3.3-1.0.0.0-rhel7.1-x86_64-ext.tgz ********************************************************************************

    2. Install the newly created MLNX_OFED package on the cluster:

      Option 1: Copy the package to the servers and install it using the mlnxofedinstall script.

      Option 2: Deploy the MLNX_OFED package using YUM (for YUM installation instructions, refer to Installing MLNX_OFED Using YUM section):

      i. Extract the resulting MLNX_OFED image and copy it to a shared NFS location.

      ii. Create a YUM repository configuration.

      iii. Install the new MLNX_OFED kernel RPMs on the servers: # yum update Example:

      Copy
      Copied!
                  

      ******************************************************************************** ... ... ======================================================================================================================== Package Arch Version Repository Size ======================================================================================================================== Updating: epel-release noarch 7-7 epel 14 k kmod-iser x86_64 1.8.0-OFED.3.3.1.0.0.1.gf583963.201606210906.rhel7u1 mlnx_ofed 35 k kmod-isert x86_64 1.0-OFED.3.3.1.0.0.1.gf583963.201606210906.rhel7u1 mlnx_ofed 32 k kmod-kernel-mft-mlnx x86_64 4.4.0-1.201606210906.rhel7u1 mlnx_ofed 10 k kmod-knem-mlnx x86_64 1.1.2.90mlnx1-OFED.3.3.0.0.1.0.3.1.ga04469b.201606210906.rhel7u1 mlnx_ofed 22 k kmod-mlnx-ofa_kernel x86_64 3.3-OFED.3.3.1.0.0.1.gf583963.201606210906.rhel7u1 mlnx_ofed 1.4 M kmod-srp x86_64 1.6.0-OFED.3.3.1.0.0.1.gf583963.201606210906.rhel7u1 mlnx_ofed 39 k   Transaction Summary ======================================================================================================================== Upgrade 7 Packages ... ... ********************************************************************************

      Note: The MLNX_OFED user-space packages will not change; only the kernel RPMs will be updated. However, “YUM update” can also update other inbox packages (not related to OFED). In order to install the MLNX_OFED kernel RPMs only, make sure to run:

      Copy
      Copied!
                  

      # yum install mlnx-ofed-kernel-only

      Note: mlnx-ofed-kernel-only is a metadata RPM that requires the MLNX_OFED kernel RPMs only.

    3. Verify that the driver can be reloaded:

      Copy
      Copied!
                  

      # /etc/init.d/openibd restart

© Copyright 2023, NVIDIA. Last updated on Jun 17, 2024.