DOCA Documentation v2.8.0
DOCA 2.8.0

Known Issues

The following table lists the known issues and limitations for this release of DOCA SDK.

Reference

Description

4032924

Description: When upgrading to DOCA 2.8.0 on RPM-based OSes, a conflict between strongswan-bf or libreswan and strongSwan may occur.

Workaround: Before upgrading, delete strongswan-bf and libreswan:

Copy
Copied!
            

yum remove strongswan-bf strongswan-swanctl libreswan

Keyword: strongSwan; upgrade

Reported in version: 2.8.0

4035553

Description: oper_sample_period does not always reflect the correct sample period. In some cases, it will reflect the admin_sample_period instead.

Workaround: N/A

Keyword: Core

Reported in version: 2.8.0

4023257

Description: If RDMA samples are compiled with memory sanitizer enabled, "read memory leak" errors are printed when running the samples with the RDMA CM flag and when running the client before the server.

Workaround: Make sure to start the RDMA Server before RDMA Client.

Keyword: DOCA RDMA; samples

Reported in version: 2.8.0

4021752

4021748

Description: In all RDMA samples, if an error occurs in any of the following functions:

  • Exporting RDMA/MMAP/Sync event

  • Connecting RDMA

  • Writing or reading the descriptors

An error is printed but the sample resumes and might:

  1. Fail later, or be in busy-wait state indefinitely; and/or

  2. Result in access to an unknown address, causing an address sanitizer violation.

Workaround for 1: Either:

  • Follow the error logs to verify no errors occurred in the relevant function. And if it did, stop the sample.

  • Fix the issue locally.

Workaround for 2: The mentioned address sanitizer violation shall be ignored in case of an error in a relevant function.

Keyword: DOCA RDMA; samples

Reported in version: 2.8.0

3961940

Description: OVS-DOCA connection tracking with E2E enabled is not supported.

Workaround: N/A

Keyword: OVS-DPDK; connection tracking; E2E

Reported in version: 2.8.0

3989851

Description: A DOCA Flow pipe has multiple actions. When the action idx is not 0 and it has a shared endecap action, a crash occurs when attempting to create an entry.

Workaround: N/A

Keyword: DOCA Flow

Reported in version: 2.8.0

3988904

Description: Failure to create a control entry with shared endecap action.

Workaround: N/A

Keyword: DOCA Flow

Reported in version: 2.8.0

3886674

Description: Installing doca-all and other DOCA metapackages does not install the mlnx-nvme driver.

Workaround: mlnx-nvme is only needed for NVMe-over-RDMA remote storage support. If you wish to install it, add the mlnx-nvme package to the install command.

  • On RHEL:

    Copy
    Copied!
                

    apt install doca-all mlnx-nvme-modules

  • On Ubuntu:

    Copy
    Copied!
                

    dnf install doca-all-kmod-mlnx-nvme

Keyword: NVMe; DOCA profile

Reported in version: 2.7.0

3885930

Description: When installing DOCA-Host on a system using NVMe storage (typically local NVMe disk), and the script doca-kernel-support is used to rebuild and install kernel modules, unloading the mlx5 drivers is only possible after also unmounting the NVMe storage, which would typically necessitate a reboot.

Workaround: N/A

Keyword: NVMe; doca-kernel-support; DOCA for host

Reported in version: 2.7.0

3837255

Description: When running Arm shutdown from the host OS it is expected to get the message -E- Failed to send Register MRSI. This message should be ignored.

Workaround: Wait 2 more minutes before rebooting the host. Before proceeding with host OS reboot, it is recommended to query the operational state of the BlueField Arm cores from the BlueField BMC to verify that shutdown state has been reached. Run the following command:

Copy
Copied!
            

ipmitool -C 17 -I lanplus -H <bmc_ip> -U root -P <password> raw 0x32 0xA3

Expected output is "06".

Keyword: Host OS; reboot; error

Reported in version: 2.7.0

3844705

Description: In OpenEuler 20.03, the Linux Kernel version 4.19.90 is affected by an issue that impacts the discard/trim functionality for the BlueField eMMC device which may cause degraded performance of the BlueField eMMC over time.

Workaround: Upgrade to Linux Kernel version 5.10 or later.

Keyword: eMMC discard; trim functionality

Reported in version: 2.7.0

3877725

Description: During BFB installation in NIC mode on BlueField-3, too much information is added into RShim log which fills it, causing the Linux installation progress log to not appear in the RShim log.

Copy
Copied!
            

echo "DISPLAY_LEVEL 2" > /dev/rshim0/misc cat /dev/rshim0/misc

Workaround: Monitor the BlueField-3 Arm's UART console to check whether BFB installation has completed or not for NIC mode.

Copy
Copied!
            

[13:58:39] INFO: Installation finished ... [14:01:53] INFO: Rebooting...

Keyword: NIC mode; BFB install

Reported in version: 2.7.0

3855702

Description: Trying to jump from a steering level in the hardware to a lower level using software steering is not supported on rdma-core lower than 48.x.

Workaround: N/A

Keyword: RDMA; SWS

Reported in version: 2.7.0

3855485

Description: When enabling the PCI_SWITCH_EMULATION_ENABLE NVconfig, the mlx devices, and potentially the RShim devices disappear. Also, looking at the kernel logs using dmesg shows the following messages:

Copy
Copied!
            

pci 0000:29:00.0: BAR 0: no space for [mem size 0x0200 0000 64bit pref] pci 0000:29:00.0: BAR 2: no space for [mem size 0x0080 0000 64bit pref] ...

Workaround: N/A

Keyword: NVconfig; RShim; dmsg

Reported in version: 2.7.0

3831230

Description: In OpenEuler 20.03, the Linux Kernel version 4.19.90 is affected by an issue that impacts the discard/trim functionality for BlueField eMMC device which may cause degraded performance of BlueField eMMC over time.

Workaround: Upgrade to Linux Kernel version 5.10 or later.

Keyword: eMMC discard; trim functionality

Reported in version: 2.7.0

3743879

Description: mlxfwreset could timeout on servers where the RShim driver is running and INTx is not supported. The following error message is printed: BF reset flow encountered a failure due to a reset state error of negotiation timeout.

Workaround: Set PCIE_HAS_VFIO=0 and PCIE_HAS_UIO=0 in /etc/rshim.conf and restart the RShim driver. Then re-run the mlxfwreset command.

If host Linux kernel lockdown is enabled, then manually unbind the RShim driver before mlxfwreset and bind it back after mlxfwreset:

Copy
Copied!
            

echo "DROP_MODE 1" > /dev/rshim0/misc mlxfwreset <arguments> echo "DROP_MODE 0" > /dev/rshim0/misc

Keyword: Timeout; mlxfwreset; INTx

Reported in version: 2.7.0

3665070

Description: Virtio-net controller fails to load if DPA_AUTHENTICATION is enabled.

Workaround: N/A

Keyword: Virtio-net; DPA

Reported in version: 2.5.0

3678069

Description: If using BlueField with NVMe and mmcbld and configured to boot from mmcblk, users must create bf.cfg file with device=/dev/mmcblk0, then install the *.bfb as normal.

Workaround: N/A

Keyword: NVMe

Reported in version: 2.5.0

3680538

Description: When using strongSwan or OVS-IPsec as explained in the NVIDIA BlueField DPU BSP, the IPSec Rx data path is not offloaded to hardware and occurs in software running on the Arm cores. As a result, bandwidth performance is substantially low.

Workaround: N/A

Keyword: IPsec

Reported in version: 2.5.0

N/A

Description: Execution unit partitions are still not implemented and would be added in a future release.

Workaround: N/A

Keyword: EU tool

Reported in version: 2.5.0

3666160

Description: Installing BFB using bfb-install when mlxconfig PF_TOTAL_SF>1700, triggers server reboot immediately.

Workaround: Change PF_TOTAL_SF to 0, perform a graceful shutdown, power cycle, then installing BFB.

Keyword: SF; PF_TOTAL_SF; BFB installation

Reported in version: 2.2.1

3594836

Description: When enabling Flex IO SDK tracer at high rates, a slow-down in processing may occur and/or some traces may be lost.

Workaround: Keep tracing limited to ~1M traces per second to avoid a significant processing slow-down. Use tracer for debug purposes and consider disabling it by default.

Keyword: Tracer FlexIO

Reported in version: 2.2.1

3592080

Description: When using UEK8 on the host in DPU mode, creating a VF on the host consumes about 100MB memory on BlueField

Workaround: N/A

Keyword: UEK; VF

Reported in version: 2.2.1

3546202

Description: After rebooting a BlueField-3 DPU running Rocky Linux 8.6 BFB, the kernel log shows the following error:

Copy
Copied!
            

[    3.787135] mlxbf_gige MLNXBF17:00: Error getting PHY irq. Use polling instead

This message indicates that the Ethernet driver will function normally in all aspects, except that PHY polling is enabled.

Workaround: N/A

Keyword: Linux; PHY; kernel

Reported in version: 2.2.0

3566042

Description: Virtio hotplug is not supported in GPU-HOST mode on the NVIDIA Converged Accelerator.

Workaround: N/A

Keyword: Virtio; Converged Accelerator

Reported in version: 2.2.0

3546474

Description: PXE boot over ConnectX interface might not work due to an invalid MAC address in the UEFI boot entry.

Workaround: On BlueField, create /etc/bf.cfg file with the relevant PXE boot entries, then run the command bfcfg.

Keyword: PXE; boot; MAC

Reported in version: 2.2.0

3561723

Description: Running mlxfwreset sync 1 on NVIDIA Converged Accelerators may be reported as supported although it is not. Executing the reset will fail.

Workaround: N/A

Keywords: mlxfwreset

Reported in version: 2.2.0

3306489

Description: When performing longevity tests (e.g., mlxfwreset, DPU reboot, burning of new BFBs), a host running an Intel CPU may observer errors related to "CPU 0: Machine Check Exception".

Workaround: Add intel_idle.max_cstate=1 entry to the kernel command line.

Keywords: Longevity; mlxfwreset; DPU reboot

Reported in version: 2.2.0

3538486

Description: When removing LAG configuration from BlueField, a kernel warning for uverbs_destroy_ufile_hw is observed if virtio-net-controller is still running.

Workaround: Stop virtio-net-controller service before cleaning up bond configuration.

Keywords: Virtio-net; LAG

Reported in version: 2.2.0

3534219

Description: On BlueField-3 devices, from DOCA 2.2.0 to 32.37.1306 (or lower), the host crashes when executing partial Arm reset (e.g., Arm reboot; BFB push; mlxfwreset).

Workaround: Before downgrading the firmware:

  1. Run:

    Copy
    Copied!
                

    echo 0 > /sys/bus/platform/drivers/mlxbf-bootctl/large_icm

  2. Reboot Arm.

Keyword: BlueField-3; downgrade

Reported in version: 2.2.0

3462630

When trying to perform a PXE installation when UEFI Secure Boot is enabled, the following error messages may be observed:

Copy
Copied!
            

error: shim_lock protocol not found. error: you need to load the kernel first.

Workaround: Download a Grub EFI binary from the Ubuntu website. For further information on Ubuntu UEFI Secure Boot PXE Boot, please visit Ubuntu's official website.

Keyword: PXE; UEFI Secure Boot

Reported in version: 2.0.2

3448841

Description: While running CentOS 8.2, switchdev Ethernet BlueField runs in "shared" RDMA net namespace mode instead of "exclusive".

Workaround: Use ib_core module parameter netns_mode=0. For example:

Copy
Copied!
            

echo "options ib_core netns_mode=0" >> /etc/modprobe.d/mlnx-bf.conf

Keyword: RDMA; isolation; Net NS

Reported in version: 2.0.2

2706803

Description: When an NVMe controller, SoC management controller, and DMA controller are configured, the maximum number of VFs is limited to 124.

Workaround: N/A

Keyword: VF; limitation

Reported in version: 2.0.2

3273435

Description: Changing the mode of operation between NIC and DPU modes results in different capabilities for the host driver which might cause unexpected behavior.

Workaround: Reload the host driver or reboot the host.

Keyword: Modes of operation; driver

Reported in version: 2.0.2

3264749

Description: In Rocky and CentOS 8.2 inbox-kernel BFBs, RegEx requires the following extra huge page configuration for it to function properly:

Copy
Copied!
            

sudo hugeadm --pool-pages-min DEFAULT:2048M sudo systemctl start mlx-regex.service systemctl status mlx-regex.service

If these commands have executed successfully you should see active (running) in the last line of the output.

Workaround: N/A

Keyword: RegEx; hugepages

Reported in version: 1.5.1

3240153

Description: DOCA kernel support only works on a non-default kernel.

Workaround: N/A

Keyword: Kernel

Reported in version: 1.5.0

3217627

Description: The doca_devinfo_rep_list_create API returns success on the host instead of Operation not supported.

Workaround: N/A

Keyword: DOCA core; InfiniBand

Reported in version: 1.5.0

© Copyright 2024, NVIDIA. Last updated on Aug 21, 2024.