Known Issues
The following is a list of general limitations and known issues of the current version of the release.
Internal Ref. Number | Issue |
4087492 | Description: When using hardware LAG, the hardware's hash may differ from the kernel's software bond hash. This discrepancy can cause packets from the same stream to be sent out through different ports. |
Keywords: LAG | |
Workaround: N/A | |
Discovered in Release: 24.07-0.6.1.0 | |
3990230 | Description: DOCA-HOST include drivers for both Ethernet and InfiniBand protocols. By installing the kernel part of DOCA-HOST (as with MLNX_OFED before it) the InfiniBand drivers, such as ib_core and its dependencies that you may have previously installed on your host server will be replaced with the DOCA-HOST drivers. |
Keywords: DOCA-HOST, kernel, inbox drivers | |
Workaround: To maintain the non-DOCA drivers, please use the relevant Inbox drivers https://docs.nvidia.com/networking/software/adapter-software/index.html#linux-inbox-drivers-upstream-releases | |
Discovered in Release: 24.07-0.6.1.0 | |
4040187 | Description: An mlx5_core probe error may occur on RHEL 9.4 after unloading the mlx5_core module. |
Keywords: mlx5_core | |
Workaround: Upgrade to kernel 5.14.0-427.20.1.el9_4 or newer to resolve the kernel panics. | |
Discovered in Release: 24.07-0.6.1.0 | |
4022803 | Description: fwctl subsystem is supported on default kernels only. |
Keywords: fwctl, kernel | |
Workaround: N/A | |
Discovered in Release: 24.07-0.6.1.0 | |
4001184 | Description: When using QEMU versions older than 8.2, interruption loss are experienced resulting in firmware commands timeouts and undefined behavior in the driver. |
Keywords: QEMU | |
Workaround: To avoid this issue, make sure:
| |
Discovered in Release: 24.07-0.6.1.0 | |
3856101 | Description: In Debian 12, using dhcpcd instead of dhclient to configure the network interface (using Networkmanager) will result in wrong network interface configuration. |
Keywords: dhcpcd, dhclient, Debian 12, Networkmanager | |
Workaround: Use dhclient to configure the network interface. | |
Discovered in Release: 24.04-0.6.6.0 | |
3964215 | Description: Driver might try to access privileged registers resulting in an error with syndrome. |
Keywords: Unbind and bind the function or restart the driver. | |
Workaround: N/A | |
Discovered in Release: 24.04-0.6.6.0 | |
3640907 | Description: When using a kernel version lower than v5.5, application termination on PCIe Gen5 servers could lead to kernel problems, such as IOMMU call traces, because of a lack of support in the AMD IOMMU kernel component. |
Keywords: PCIe Gen5, IOMMU, Call Trace | |
Workaround: To resolve the issue either:
or
| |
Discovered in Release: 24.04-0.6.6.0 | |
3004304 | Description: Setting NVMe |
Keywords: NVMe, Call Trace, | |
Workaround: N/A | |
Discovered in Release: 24.01-0.3.3.1 | |
3735400 | Description: The |
Keywords: NVMF connect, PI, Adaptive Routing | |
Workaround: Disable the AR at the opensm, or, alternatively, disable the PI at the nvme_rdma with a new module parameter. | |
Discovered in Release: 24.01-0.3.3.1 | |
3774149 | Description: In some cases, there could be a race condition between RDMA_WRITE and shared memory write, leading to the MPI receiving invalid data with large messages or collective operations between ranks on the same node. |
Keywords: Race condition, RDMA_WRITE, shared memory write | |
Workaround: Set | |
Discovered in Release: 24.01-0.3.3.1 | |
3565433 | Description: An error may occur when creating a DCI due to oversized WQEs. This is caused by a loose enforcement of the allowed max quantity of SGEs. |
Keywords: DCI, SGEs | |
Workaround: N/A | |
Discovered in Release: 24.01-0.3.3.1 | |
3732632 | Description: Geneve offload does not opeate together with FLEX_PARSER. |
Keywords: Geneve offload, FLEX_PARSER | |
Workaround: Make sure that the firmware is appropriately configured by verifying that the | |
Discovered in Release: 24.01-0.3.3.1 | |
3644590 | Description: When working in switchdev mode, the number of XFRM IN rules that can be added is limited to 2047. |
Keywords: switchdev mode, XFRM IN rules | |
Workaround: N/A | |
Discovered in Release: 24.01-0.3.3.1 | |
3563584 | Description: In case of a steering loop, the packet would loop indefinitely, causing a device hang. |
Keywords: Steering loop | |
Workaround: Enable firmware infinite loop protection. | |
Discovered in Release: 24.01-0.3.3.1 |
Internal Ref. Number | Issue |
3678715 | Description: When attempting to restart drivers using openIbd service while the nvme_rdma module is loaded, the process may fail. This behavior is intentional, as unloading nvme_rdma during the driver restart can lead to connectivity issues in other applications within the setup. |
Keywords: openIbd service, nvme_rdma module | |
Workaround: Manually unload the nvme_rdma module before performing the driver restart. This can be achieved using the | |
Discovered in Release: 23.10-1.1.9.0 | |
3676223 | Description: When using kernel version 4.12 or above, it is advised to run echo 0 > /sys/bus/pci/devices/0000\:08\:00.0/sriov_drivers_autoprobe to avoid VF probing |
Keywords: VF probing | |
Workaround: N/A | |
Discovered in Release: 23.10-1.1.9.0 | |
3682658 | Description: While using the RDMA-CM user application and the AF_IB parameter, the kernel uses only the first byte of the private data to set the CMA version. In such scenario, any user data written to this byte will be overwritten. |
Keywords: RDMA-CM user application, AF_IB, private data | |
Workaround: Do not use AF_IB for application's private data. | |
Discovered in Release: 23.10-0.5.5.0 | |
3640082 | Description: A potential null pointer dereference might occur due to a missing update in the PCI subsystem code when creating the maximum number of VFs. All kernel versions lacking the following fix are impacted: "PCI: Avoid enabling PCI atomics on VFs." |
Keywords: Maximal VF number | |
Workaround: N/A | |
Discovered in Release: 23.10-0.5.5.0 | |
3653417 | Description: When offloading IPsec policy rules while in legacy mode there are two options:
2. Changing the steering mode to firmware steering will return unsupported. |
Keywords: IPsec, legacy mode | |
Workaround: Perform a devlink reload after changing the steering mode. | |
Discovered in Release: 23.10-0.5.5.0 | |
3612274 | Description: Currently, either IPsec offload or TC offload for a specific interface is allowed. The offloading TC rule to an interface will fail if an IPSec rule is already offloaded on it, and vice-versa. |
Keywords: IPsec offload, TC offload | |
Workaround: N/A | |
Discovered in Release: 23.10-0.5.5.0 | |
3596126 | Description: OVS mirroring of both egress and ingress together with modified TTL is not supported by Connectx-5 cards, and may cause packets checksum issues and errors in the dmesg command. |
Keywords: OVS mirroring, Connectx-5 | |
Workaround: N/A | |
Discovered in Release: 23.10-0.5.5.0 | |
3538463 | Description: A Kernel ABI problem in Sles15SP4 may lead to issues during driver start. This impacts kernels starting from version 5.14.21-150400.24.11.1 up to version 5.14.21-150400.24.63.1 (July 2022 to May 2023), inclusive. For more information, see https://www.suse.com/support/kb/doc/?id=000021137. |
Keywords: Kernel ABI, Sles15SP4, driver start | |
Workaround: Upgrade to a kernel version newer than 5.14.21-150400.24.63.1 (May 2023). | |
Discovered in Release: 23.10-0.5.5.0 | |
3637252 | Description: When running over REHL7.6 with excessive RDMA/RoCE workload, kernel warnings may be triggered. |
Keywords: REHL7.6, RDMA, RoCE | |
Workaround: N/A | |
Discovered in Release: 23.10-0.5.5.0 |
Internal Ref. Number | Issue |
3046655 | Description: A package manager upgrade with zypper (on an SLES system) may prompt a question about vendor change from "Mellanox Technologies" to "OpenFabrics". |
Keywords: Installation, SLES | |
Workaround: Either accept the prompted change, or add the /etc/zypp/vendors.d/mlnx_ofed file with the following content: [main] vendors = Mellanox,OpenFabrics | |
Discovered in Release: 23.07-0.5.0.0 | |
3392477 | Description: The ConnectX-7 firmware embedded in this MLNX_OFED version cannot be burnt using the MLNX_OFED installer script. |
Keywords: ConnectX-7, MLNX_OFED installer script | |
Workaround: Please download and install the dedicated firmware from the web https://network.nvidia.com/support/firmware/connectx7ib/ | |
Discovered in Release: 23.07-0.5.0.0 | |
3532756 | Description: The kernel may crash when restarting the driver while IP sec rules are configured. |
Keywords: IP sec | |
Workaround: Flush the IP sec configuration before reloading the driver: ip xfrm state flush ip xfrm policy flush | |
Discovered in Release: 23.07-0.5.0.0 | |
3472979 | Description: When a large number of virtual functions are present, the output of the |
Keywords: virtual functions, ip link show | |
Workaround: N/A | |
Discovered in Release: 23.07-0.5.0.0 | |
3413938 | Description: When using the mlnx-sf script, creating and deleting an SF with the same ID number in a stressful manner may cause the setup to hang due to a race between the create and delete commands. |
Keywords: Hang; mlnx-sf | |
Workaround: N/A | |
Discovered in Release: 23.07-0.5.0.0 | |
3461572 | Description: Configuring Multiport Eswitch LAG mode can be performed only via devlink from this release onwards. The compat sysfs should not be used to configure mpesw LAG. |
Keywords: Multiport Eswitch, compat sysfs, mpesw LAG | |
Workaround: N/A | |
Discovered in Release: 23.07-0.5.0.0 | |
3464337 | Description: Simultaneously adding or removing TC rules while operating on kernel version 6.3 could potentially result in stability issues. |
Keywords: ASAP, rules, TC | |
Workaround: Make sure the following fix is part of the kernel: https://lore.kernel.org/netdev/20230504181616.2834983-3-vladbu@nvidia.com/T/ | |
Discovered in Release: 23.07-0.5.0.0 | |
3469484 | Description: Mirror and connection tracking (CT) offload actions are not supported simultaneously if the kernel version does not support hardware miss to TC actions. Thus, when performing a CT offload test, the actual number of offloaded connections may be lower than expected. |
Keywords: ASAP, CT offload | |
Workaround: Make sure to have the following offending commit in the tree: net/sched: act_ct: offload UDP NEW connections Make sure to to have https://www.spinics.net/lists/stable-commits/msg303536.html in the kernel tree to fix this issue. | |
Discovered in Release: 23.07-0.5.0.0 | |
3473331 | Description: When performing a CT offload test, the actual number of offloaded connections may be lower than expected. |
Keywords: ASAP, CT offload | |
Workaround: The fix is external to the driver, make sure to have this commit in the tree: offending commit: net/sched: act_ct: offload UDP NEW connections Make sure you have: https://www.spinics.net/lists/stable-commits/msg303536.html in the kernel tree to fix this issue. | |
Discovered in Release: 23.07-0.5.0.0 |
Internal Ref. Number | Issue |
3360710 | Description: Configuring PFC in parallel to buffer size and prio2buffer commands may lead to misalignment between firmware and software in regards to receiving buffer ownership. |
Keywords: NetDev, PFC, Buffer Size, prio2buffer | |
Workaround: First, configure PFC on all ports, and then perform other needed QoS (i.e., buffer_size or prio2buffer) configurations accordingly. | |
Discovered in Release: 23.04-0.5.3.3 | |
3413879 | Description: OpenSM may not be started automatically if chkconfig was not installed before OpenSM is installed. Note, however, that chkconfig will fail to install if the directory (rather than symbolic link to directory) /etc/init.d already exists (e.g., from a previous installation of MLNX_OFED). |
Keywords: Installation, OpenSM, chkconfig | |
Workaround: Install chkconfig before installing MLNX_OFED. If installing it fails, make sure /etc/init.d does not exist at the time of installing it. | |
Discovered in Release: 23.04-0.5.3.3 | |
3424596 | Description: On SLES 15.4, installing MLNX_OFED using a package repository (with zypper) may trigger an error message about missing dependency for 'librte_eal.so.20.0()(64bit)' . This is because the inbox package libdpdk-20_0 is being uninstalled as it is incompatible with the MLNX_OFED rdma-core packages. |
Keywords: Installation, SLES 15.4 | |
Workaround: Uninstall the relevant packages: 'zypper uninstall libdpdk-20_0' before installing MLNX_OFED. This will also remove the inbox openvswitch package. | |
Discovered in Release: 23.04-0.5.3.3 | |
3433416 | Description: On systems that were installed with MLNX_OFED 5.9 or older and include a CUDA package (ucx-cuda / hcoll-cuda), an upgrade to MLNX_OFED 23.04 using the package manager ("yum") method will fail. This is because MLNX_OFED up to 5.9 is built with CUDA 11. MLNX_OFED 23.04 is built with CUDA 12 and those CUDA versions are incompatible. |
Keywords: Installation, CUDA, yum | |
Workaround: Remove CUDA packages included with OFED (ucx-cuda, hcoll-cuda) before upgrading. This will allow to upgrade MLNX_OFED regardless of CUDA version installed. To install them later, CUDA 12 must be installed on the system. | |
Discovered in Release: 23.04-0.5.3.3 | |
3420831 | Description: mlx-steering-dump is not supported on systems in which Python3 is not the default. |
Keywords: mlx-steering-dump, Python3 | |
Workaround: N/A | |
Discovered in Release: 23.04-0.5.3.3 | |
3351989 | Description: If the underlying persistent device name exceeds 15 characters in length, the operating system will not be able to perform renaming (i.e., the device name will remain "eth |
Keywords: Persistant Interface Names | |
Workaround: Add the --copy-ifnames-udev flag to the OFED installation command. Note that this flag is only applicable if the persistent name provided by the kernel, without the 'np | |
Discovered in Release: 23.04-0.5.3.3 |