NVIDIA BlueField Platform Software Troubleshooting Guide

On This Page

MLNX_DPDK

This page offers troubleshooting information for DPDK users and customers. It is advisable to review the following resources:

Command

Description

ibdev2netdev -v

Part of the OFED package, this command displays all associations between network devices and RDMA adapter ports, providing detailed information for troubleshooting and configuration.

lspci

A Linux command that provides detailed information about each PCIe bus and the devices connected to them. This tool is essential for identifying PCIe devices and troubleshooting hardware configurations.

ethtool

A Linux command used to query or modify network driver and hardware settings. This tool is commonly utilized for managing Ethernet device configurations, such as speed, duplex mode, and auto-negotiation, as well as for gathering detailed diagnostic information.

ip

A Linux command used to assign addresses to network interfaces and configure various network parameters. It is the modern replacement for the deprecated ifconfig command and provides enhanced functionality for managing routing, tunnels, and more.

devlink

An API and CLI tool for exposing and managing device-wide information and resources that are not specific to any particular device class. This includes configurations and attributes such as chip-wide or switch-ASIC-wide settings.

meson -Denable_drivers=*/mlx5,mempool/bucket,bus/auxiliary,mempool/ring,mempool/stack <build_dir> && ninja -C <build_dir>

This command builds the Data Plane Development Kit (DPDK) with support for the MLX5 driver, along with additional specified drivers and components (e.g., memory pools and auxiliary buses).

  • meson – configures the build environment and specifies the components to include

  • ninja – executes the build process in the specified <build_dir> directory.

echo -n $vf0_pci > /sys/bus/pci/drivers/mlx5_core/unbind

echo -n $vf1_pci > /sys/bus/pci/drivers/mlx5_core/unbind

devlink dev eswitch set pci/${pci_addr} mode switchdev

echo $vf_num > /sys/bus/pci/devices/${pci_addr}/sriov_numvfs

Or:

echo $vf_num >/sys/bus/pci/devices/$pci/mlx5_num_vfs

echo -n $vf0_pci > /sys/bus/pci/drivers/mlx5_core/bind

echo -n $vf1_pci > /sys/bus/pci/drivers/mlx5_core/bind

Sets switchdev mode with 2 VFs.

Note

The mlx5_num_vfs parameter is always available, regardless of whether the OS has loaded the virtualization module (e.g., when adding intel_iommu support to the GRUB file). In contrast, the sriov_numvfs parameter is applicable only if intel_iommu has been added to the GRUB file. If the sriov_numvfs file is not visible, verify that intel_iommu has been correctly included in the GRUB configuration.

Compilation Debug Flags

Debug Flag

Description

-Dbuildtype=debug

Sets the build type as debug, which enables debugging information and disables optimizations

-Dc_args='-DRTE_ENABLE_ASSERT -DRTE_LIBRTE_MLX5_DEBUG'

Activates assertion checks and enables debug messages for MLX5


Steering Dump Tool

The steering dump tool allows the application to dump its specific data and triggers the hardware to dump the associated hardware data. For additional details, refer to mlx_steering_dump.

dpdk-proc-info Application

The dpdk-proc-info application functions as a secondary process within the DPDK environment and provides the following capabilities:

  • Retrieve port statistics

  • Reset port statistics

  • Print DPDK memory information

  • Display debug information for ports

For more information, refer to the Dpdk-proc-info Application Guide.

Reproducing an Issue with testpmd Application

The testpmd application can be used to replicate issues in simplified scenarios. For detailed instructions, refer to the Testpmd Application User Guide.

DPDK Unit Test Tool

The DPDK Unit Test Tool facilitates DPDK unit testing using the testpmd and Scapy applications. For additional details, refer to the DPDK Unit Test Tool.

Memory Errors

No Free Hugepages Reported

If the following memory error is encountered during the startup of testpmd:

Copy
Copied!
            

EAL: No free 2048 kB hugepages reported on node 0 EAL: FATAL: Cannot get hugepage information. EAL: Cannot get hugepage information.

This error indicates that no hugepages have been allocated. To configure hugepages, run:

Copy
Copied!
            

sysctl vm.nr_hugepages=<#huge_pages>


Insufficient Memory for MLX5 Hash List Creation

If the following memory error is encountered during the startup of testpmd:

Copy
Copied!
            

mlx5_common: mlx5_common_utils.c:420: mlx5_hlist_create(): No memory for hash list mlx5_0_flow_groups creation

This error suggests insufficient free hugepages. Verify the availability of free hugepages by running:

Copy
Copied!
            

grep -i hugepages /proc/meminfo

Compatibility Issues Between Firmware and Driver

Symptom: T he mlx5 driver is not loading.

Resolution : This issue may be caused by a compatibility mismatch between the firmware and the driver. When this occurs, the driver fails to load and an error message is logged in the dmesg output.

To resolve this issue verify that the firmware version is compatible with the driver version in use. Refer to the MLNX_OFED Documentation for detailed information on supported firmware and driver versions.

Restarting the Driver After Removing a Physical Port

Symptom: Operational issues occur due to a physical port being removed from an OVS-DPDK bridge while offload is enabled.

Resolution : W hen offload is enabled, removing a physical port from an OVS-DPDK bridge requires restarting the OVS service. Failing to restart the service can cause incorrect datapath rule configurations.

To resolve this issue:

  1. Reattach the physical port to the bridge according to the desired topology.

  2. Restart the Open vSwitch (OVS) service to restore proper functionality.

Dec_ttl Feature Not Working

dec_ttl is only supported on NVIDIA® ConnectX®-6 adapters and higher.

Deadlock When Moving to switchdev Mode

Symptom: A deadlock occurs when transitioning to switchdev mode while deleting a namespace.

Resolution : To prevent this issue, unload the mlx5_ib module before switching to switchdev mode.

Unusable System After Unloading the mlx5_core Driver

Symptom: System becomes unresponsive after unloading the mlx5_core driver while running from a network boot and using a ConnectX adapter to connect to network storage.

Resolution : Unloading the mlx5_core driver (e.g., by running /etc/init.d/openibd restart) in this scenario causes system instability. This scenario should simply be avoided as there is no resolution to it.

Incompatibility Between RHEL 7.6alt and CentOS 7.6alt Kernels

Symptom: Attempting to install MLNX_OFED on a system with the CentOS 7.6alt kernel results in some kernel modules built for RHEL 7.6alt failing to load.

Resolution : The kernel used in CentOS 7.6alt (for non-x86 architectures) differs from the kernel in RHEL 7.6alt. Consequently, MLNX_OFED kernel modules compiled for the RHEL 7.6alt kernel may not load on a CentOS 7.6alt system. To resolve this issue, rebuild the kernel modules specifically for the CentOS 7.6alt kernel.

Cannot Add VF 0 Representor

Symptom: The following error is encountered when attempting to add VF 0 representor:

Copy
Copied!
            

mlx5_pci port query failed: Input/output error

Resolution : Ensure that the VF configuration is fully completed before starting the DPDK application.

EAL Initialization Failure

EAL initialization failure is a common error that may occur when running various DPDK-related applications.

The error typically appears as follows:

Copy
Copied!
            

[DOCA][ERR][NUTILS]: EAL initialization failed

This error may result from several issues, including:

  • The application requires huge pages, but none have been allocated

  • The application requires root privileges to run but was executed without elevated privileges

To address this issue, follow either of the following solutions which correspond to the previously listed potential causes:

  • Run the following commands on the host or BlueField, depending on where the application is being executed:

    Copy
    Copied!
                

    $ echo '2048' | sudo tee -a /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages $ sudo mkdir /mnt/huge $ sudo mount -t hugetlbfs -o pagesize=2M nodev /mnt/huge

  • Execute the application using sudo or as root user:

    Copy
    Copied!
                

    sudo <run_command>

DPDK EAL Limitation with More than 128 Cores

Symptom: When running with 190 cores, DPDK detects only 128 cores, and the following messages appear:

Copy
Copied!
            

dpdk/bin/dpdk-test-compress-perf -a 0000:11:00.0,class=compress -l 0,190 -- ... EAL: Detected 128 lcore(s) EAL: Detected 4 NUMA nodes EAL: invalid core list syntax

Resolution : To resolve this issue, compile DPDK with the parameter -Dmax_lcores=256. This allows DPDK to recognize additional cores beyond the default limit of 128.

Packets Dropped When Transferring to UDP Port 4789

Symptom : While running testpmd, the following flow rules are created:

Copy
Copied!
            

flow create 1 priority 1 transfer ingress group 0 pattern eth / vlan / ipv4 / udp dst is 4789  / vxlan / end actions  jump group 820 / end flow create 1 priority 0 transfer ingress group 820 pattern eth / vlan / ipv4 / udp dst spec 4789  dst mask 0xfff0  / vxlan / eth / vlan / ipv4 / end actions modify_field op set dst_type udp_port_dst src_type value src_value 0x168c9e8d4df8f width 2 /  port_id id 0 / end

When sending a packet that matches these rules, the packet is dropped and is not received on the port 0 peer.

Resolution : UDP port 4791 is reserved for RDMA over Converged Ethernet (RoCE) traffic. By default, RoCE is enabled on all mlx5 devices, and traffic to UDP port 4791 is treated as RoCE traffic.

To forward traffic to this port for Ethernet use (without RDMA), disable RoCE using the following command:

Copy
Copied!
            

echo <0|1> > /sys/devices/{pci-bus-address}/roce_enable

Refer to the MLNX_OFED Documentation for more details on disabling RoCE.

Ping Loss with 254 vDPA Devices

Symptom: Intermittent ping loss or errors are observed on some interfaces when the following is performed:

  1. Create 127 Virtual Functions (VFs) on each port.

  2. Configure VF link aggregation (LAG).

  3. Start the vDPA example with 254 ports.

  4. Launch 16 virtual machines (VMs), each with up to 16 interfaces.

  5. Send pings from each vDPA interface to an external host.

Resolution : This issue can be resolved by adding the runtime configuration parameter event_core=x, where x specifies the CPU core number to be used for the timer thread. By default, the event_core parameter uses the EAL main lcore.

Note

The event_core parameter can be shared among different mlx5 vDPA devices. However, allocating this core to additional tasks may negatively impact the performance and latency of the mlx5 vDPA devices.


No TX Traffic with testpmd and HWS dv_flow=2

Symptom: Packets are not forwarded in forward mode when running testpmd with the following command:

Copy
Copied!
            

dpdk-testpmd -n 4 -a 08:00.0,representor=0,dv_flow_en=2 -- -i --forward-mode=txonly -a

Resolution : To resolve this issue, ensure that hardware steering (HWS) queues are properly configured for all ports, including the representor ports (representor=0-1) and the PFs. This configuration is required to establish the default flows necessary for traffic forwarding.

OVS-DPDK - Duplicate Packets When Co-running with SPDK

Symptom: Duplicate packets are observed when running OVS-DPDK with offload alongside SPDK, both of which attach to the same PF.

Resolution : To address this issue, set the parameter dv_esw_en=0 on the OVS-DPDK side. This disables E-Switch using Direct Rules, which is enabled by default if supported.

Live Migration VM Stuck in PMD MAC Swap Mode

Symptom: Running the following scenario results in an endless loop during live migration:

  1. Open 64 VFs.

  2. Configure OVS-DPDK [VF].

  3. Start a VM with 16 devices.

  4. Initiate traffic between VMs in PMD MAC swap mode.

Attempt live migration on Hyper-V with the following command:

Copy
Copied!
            

 virsh migrate --live --unsafe --persistent --verbose qa-r-vrt-123-009-CentOS-8.2  qemu+ssh://qa-r-vrt-125/system tcp://22.22.22.125

The migration process stalls with a repeating status:

Copy
Copied!
            

Migration: [ 75 %]  // repeats in a loop

Resolution : To address this known issue, use the auto-converge parameter to handle heavy vCPU or traffic loads during live migration, as migration may not complete otherwise. Run the following command:

Copy
Copied!
            

virsh migrate ... --auto-converge --auto-converge-initial 60 --auto-converge-increment 20


Failure to Create Rule with Raw Encapsulation in Switchdev Mode

Symptom : An error occurs when attempting to create a rule with encapsulation on port 0. The following messages are displayed:

Copy
Copied!
            

mlx5_net: [mlx5dr_action_create_reformat_root]: Failed to create dv_create_flow reformat mlx5_net: [mlx5dr_action_create_reformat]: Failed to create root reformat action Template table #0 destroyed port_flow_complain(): Caught PMD error type 1 (cause unspecified): fail to create rte table: Operation not supported

Resolution : To resolve this issue:

  1. Disable encapsulation with the following command:

    Copy
    Copied!
                

    echo none > /sys/class/net/<ethx>/compat/devlink/encap

  2. If tunnels must be offloaded in virtual functions (VFs) or scalable-functions (SFs), ensure encapsulation is disabled in the forwarding database (FDB) domain. This is necessary because the NIC does not support encapsulation in both domains simultaneously.

Unable to Create Flow When Having L3 VXLAN With External Process

Symptom: After detaching and reattaching ports, an error is encountered when attempting to create a flow that matches on L3 VXLAN.

Example scenario:

  1. Detach ports:

    Copy
    Copied!
                

    port stop all device detach 0000:08:00.0 device detach 0000:08:00.1

  2. Re-attach ports:

    Copy
    Copied!
                

    mlx5 port attach 0000:08:00.0 socket=/var/run/external_ipc_socket mlx5 port attach 0000:08:00.1 socket=/var/run/external_ipc_socket port start all

  3. Create a rule:

    Copy
    Copied!
                

    testpmd> flow create 1 priority 2 ingress group 0 pattern eth dst is 00:16:3e:23:7c:0c has_vlan spec 1 has_vlan mask 1 / vlan vid is 1357 ... / ipv6 src is ::9247 ... / udp dst is 4790 / vxlan-gpe / end actions queue index 65000 / end

  4. Error printed:

    Copy
    Copied!
                

    port_flow_complain(): Caught PMD error type 13 (specific pattern item): cause: 0x7ffd041c6be0, L3 VXLAN is not enabled by device parameter and/or not configured in firmware: Operation not supported

Resolution : The error occurs because the l3_vxlan_en parameter is not set when attaching the device. This parameter is required to enable L3 VXLAN and VXLAN-GPE flow creation.

To address this issue:

  1. Enable the l3_vxlan_en parameter – ensure that the l3_vxlan_en parameter is set to a nonzero value when attaching the device. This enables the creation of L3 VXLAN and VXLAN-GPE flows.

  2. Configure firmware support – verify that the firmware is configured to support L3 VXLAN or VXLAN-GPE traffic. By default, this parameter is disabled and must be explicitly enabled.

  3. Verify configurations – confirm that both the device parameters and firmware configurations are in place to handle L3 VXLAN traffic successfully.

Buffer Split - Failure to Configure max-pkt-len

Symptom : When running testpmd and attempting to change the max-pkt-len configuration, an error occurs.

  • Command:

    Copy
    Copied!
                

    dpdk-testpmd -n 4 -a 0000:00:07.0,dv_flow_en=1,... -a 0000:00:08.0,dv_flow_en=1,... -- --mbcache=512 -i --nb-cores=15 --txd=8192 --rxd=8192 --burst=64 --mbuf-size=177,430,417 --enable-scatter --tx-offloads=0x8000 --mask-event=intr_lsc port stop all port config all max-pkt-len 9216 port start all

  • Error output:

    Copy
    Copied!
                

    Configuring Port 0 (socket 0): mlx5_net: port 0 too many SGEs (33) needed to handle requested maximum packet size 9216, the maximum supported are 32 mlx5_net: port 0 unable to allocate queue index 0 Fail to configure port 0 rx queues

Resolution : To resolve this issue:

  1. Enable buffer split – start testpmd with the rx-offloads=0x102000 parameter to enable buffer split.

  2. Configure receive packet size – use the set rxpkts (x[,y]*) command to configure the receive packet size, where:

    • x[,y]* is a comma-separated list of values.

    • A zero value indicates using the memory pool data buffer size.

    Example:

    Copy
    Copied!
                

    set rxpkts 49,430,417

Unable to Start testpmd with Shared Library

Symptom : An error occurs when attempting to run testpmd from the download folder:

  • Command:

    Copy
    Copied!
                

    /tmp/dpdk/build-meson/app/dpdk-testpmd -n 4 -w ...

  • Error output:

    Copy
    Copied!
                

    EAL: Error, directory path /tmp is world-writable and insecure EAL: FATAL: Cannot init plugins EAL: Cannot init plugins EAL: Error - exiting with code: 1 Cause: Cannot init EAL: Invalid argument

Resolution : The error occurs because the directory where DPDK is located (e.g., /tmp) has improper permissions, making it world-writable and insecure. To resolve this issue:

  1. Rebuild DPDK in a directory with stricter permissions.

  2. Ensure that the directory is not world-writable to meet the security requirements of DPDK.

Cross-Port Action: Failure to Create Actions Template

Symptom: When attempting to configure two ports, where port 1 is set as the host port for port 0, an error occurs during the creation of the action template on port 0.

Reproduction steps:

  1. Run testpmd:

    Copy
    Copied!
                

    /download/dpdk/install/bin/dpdk-testpmd -n 4 -a 0000:08:00.0,dv_flow_en=2,dv_xmeta_en=0 -a 0000:08:00.1,dv_flow_en=2,dv_xmeta_en=0 --iova-mode="va" -- --mbcache=512 -i  --nb-cores=7  --rxq=8 --txq=8 --txd=2048 --rxd=2048

  2. Stop all ports:

    Copy
    Copied!
                

    port stop all

  3. Configure flows

    Copy
    Copied!
                

    flow configure 0 queues_number 16 queues_size 256 counters_number 0 host_port 1 flags 2  flow configure 1 queues_number 16 queues_size 256 counters_number 8799

  4. Start all ports:

    Copy
    Copied!
                

    port start all

  5. Error output:

    Copy
    Copied!
                

    flow queue 1 indirect_action 5 create postpone false action_id 3 ingress action count / end  flow push 1 queue 5  flow pull 1 queue 5  flow actions_template 0 create actions_template_id 8 template shared_indirect 1 3 / end mask count / end    Actions template #8 destroyed port_flow_complain(): Caught PMD error type 16 (specific action): cause: 0x7ffd5610a210, counters pool not initialized: Invalid argument

Resolution : To resolve this issue, configure the ports in the correct order, with the host port configured first.

Use the following updated sequence:

  1. Stop all ports:

    Copy
    Copied!
                

    port stop all

  2. Configure port 1 (host port):

    Copy
    Copied!
                

    flow configure 1 queues_number 16 queues_size 256 counters_number 8799

  3. Configure port 0:

    Copy
    Copied!
                

    flow configure 0 queues_number 16 queues_size 256 counters_number 0 host_port 1 flags 2

  4. Start all ports:

    Copy
    Copied!
                

    port start all

RSS Hashing Issue with GRE Traffic

Symptom : GRE traffic is not being RSS hashed to multiple cores when received on a ConnectX-6 interface. However, the same traffic is correctly hashed on an Intel i40e interface.

Resolution : To enable RSS hashing for GRE traffic over inner headers, create a flow rule that specifically matches tunneled packets. Flow rule example:

  • Pattern:

    Copy
    Copied!
                

    ETH / IPV4 / GRE / END

  • Actions:

    Copy
    Copied!
                

    RSS(level=2, types=…) / END

If you attempt to use RSS level 2 in a flow rule without including a tunnel header item, you will encounter the following error:

Copy
Copied!
            

inner RSS is not supported for non-tunnel flows


Unable to Probe SF/VF Device with crypto-perf-test App on Arm

Symptom: An error occurs when attempting to probe an SF on startup using the crypto-perf-test application on an Arm host with the following command:

Copy
Copied!
            

dpdk-test-crypto-perf -c 0x7ff -a auxiliary:mlx5_core.sf.1,class=crypto,algo=1 --  --ptest verify --aead-op decrypt --optype aead --aead-algo aes-gcm ....

Error output:

Copy
Copied!
            

No crypto devices type mlx5_pci available USER1: Failed to initialise requested crypto device type

When attempting to use the representor keyword to probe SF or VF, the following error is received:

Copy
Copied!
            

mlx5_common: Key "representor" is unknown for the provided classes.

Resolution : On Arm platforms:

  • Only VF/SF representors are available, not the actual PCIe devices

  • Probing with the representor keyword or using real PCIe device types is not supported

Ensure configurations are aligned with the platform's limitations to avoid these issues.

Failed to Create ESP Matcher on VFs in Template Mode

Symptom : An error occurs when attempting to create a pattern template on a VF using testpmd with the following commands :

Copy
Copied!
            

dpdk-testpmd -n 4  -a 0000:08:00.2,...,dv_flow_en=2,dv_xmeta_en=0  -a 0000:08:00.4,...,dv_flow_en=2,... -- --mbcache=512 -i  --nb-cores=7  ... port stop all  flow configure 0 queues_number 7 queues_size 256 meters_number 0 counters_number 0 quotas_number 256  flow configure 1 queues_number 14 queues_size 256 meters_number 0 counters_number 7031 quotas_number 64  port start all    flow pattern_template 0 create ingress pattern_template_id 0 relaxed no template eth / ipv4 /  esp /  end  flow actions_template 0 create actions_template_id 0 template jump group 88 / end mask jump group 88 / end  flow template_table 0 create table_id 0 group 72 priority 0 ingress rules_number 64 pattern_template 0 actions_template 0

Error output:

Copy
Copied!
            

mlx5_net: [mlx5dr_definer_conv_items_to_hl]: Failed processing item type: 23 mlx5_net: [mlx5dr_definer_calc_layout]: Failed to convert items to header layout mlx5_net: [mlx5dr_definer_matcher_init]: Failed to calculate matcher definer layout mlx5_net: [mlx5dr_matcher_bind_mt]: Failed to set matcher templates with match definers mlx5_net: [mlx5dr_matcher_create]: Failed to initialise matcher: 95

Resolution : Currently, IPsec offload can only be supported on a single path—either on a PF, VF, or e-switch. However:

  • DPDK does not yet support IPsec offload on VFs

  • IPsec is limited to configurations involving PFs or e-switches

As a result, configuring IPsec on VFs is not supported at this time.

Failure to Receive Hairpin Traffic (HWS) Between Two Physical Ports

Symptom : Traffic is not received on port 1 when running testpmd in hairpin mode with the following setup:

  • Port 0 – configured to create simple rules and send traffic.

  • Port 1 – monitored to measure received traffic. However, no traffic is observed.

Setup:

  • Devices – two BlueField-2 devices (DUT and TG) connected back-to-back via symmetrical interfaces (p0 and p1)

  • Firmware – configured for HWS

Steps:

  1. Run DUT in hairpin mode:

    Copy
    Copied!
                

    dpdk-testpmd -c 0xff -n 4 -a 0000:03:00.0,dv_flow_en=2 -a 0000:03:00.1,dv_flow_en=2 ... --forward-mode=rxonly -i --hairpinq 1 --hairpin-mode=0x12

  2. Configure DUT:

    Copy
    Copied!
                

    port stop all flow configure 0 queues_number 4 queues_size 64 port start all   flow pattern_template 0 create pattern_template_id 0 ingress template eth / end flow actions_template 0 create actions_template_id 0 template jump group 1 / end mask jump group 1 / end flow template_table 0 create table_id 0 group 0 priority 0 ingress rules_number 64 pattern_template 0 actions_template 0 flow queue 0 create 0 template_table 0 pattern_template 0 actions_template 0 postpone no pattern eth / end actions jump group 1 / end flow pull 0 queue 0   flow pattern_template 0 create pattern_template_id 2 ingress template eth / end flow actions_template 0 create actions_template_id 2 template queue / end mask queue / end flow template_table 0 create table_id 2 group 1 priority 0 ingress rules_number 64 pattern_template 2 actions_template 2 flow queue 0 create 0 template_table 2 pattern_template 0 actions_template 0 postpone no pattern eth / end actions queue index 4 / end flow pull 0 queue 0   start

  3. Send traffic from TG on port 0:

    Copy
    Copied!
                

    /tmp/dpdk-hws/app/dpdk-testpmd -c 0xff -n 4 -a 0000:03:00.0,dv_flow_en=1 --socket-mem=2048 -- --port-numa-config=0 --socket-num=0 --burst=64 --txd=1024 --rxd=1024 --mbcache=512 --rxq=4 --txq=4 --nb-cores=1 --forward-mode=txonly --no-lsc-interrupt -i

  4. Measure traffic rate on TG interface p1:

    Copy
    Copied!
                

    mlnx_perf -i p1

    • Expected result: 2.5 Gb/s

    • Actual result: None

Resolution : The absence of traffic on port 1 is likely due to improper HWS configuration on port 1. Without correct HWS settings, default SQ miss rules are not inserted, preventing traffic reception.

Ensure that both ports (Port 0 and Port 1) are correctly configured with HWS queues. Update the configuration as follows:

Copy
Copied!
            

port stop all flow configure 0 queues_number 4 queues_size 64 flow configure 1 queues_number 4 queues_size 64 port start all

This ensures proper HWS queue setup and insertion of default SQ miss rules, allowing traffic to flow as expected.

DPDK-OVS Memory Allocation Error

Symptom : During initialization, the following error is encountered:

Copy
Copied!
            

dpdk|ERR|EAL: eal_memalloc_alloc_seg_bulk(): couldn't find suitable memseg_list

Resolution : Recent updates to OVS and DPDK have changed the default behavior of memory configuration arguments:

  1. The EAL argument --socket-mem is no longer configured by default during start-up. If dpdk-socket-mem and dpdk-alloc-mem are not explicitly set, DPDK will use its default settings.

  2. The EAL argument --socket-limit no longer defaults to the value of --socket-mem.

To avoid this error, update your memory configuration as follows:

  1. Explicitly set the dpdk-socket-mem parameter to specify the desired memory allocation.

  2. Set other_config:dpdk-socket-limit to the same value as other_config:dpdk-socket-mem to maintain the previous behavior and ensure memory-limiting consistency.

Example configuration:

Copy
Copied!
            

other_config:dpdk-socket-mem=2048,0 other_config:dpdk-socket-limit=2048,0


High Latency in VDPA Ping Traffic Between VMs with 240 SFs

Symptom : Ping traffic between virtual machines (VMs) exhibits high latency when configured with 240 SFs.

Resolution : To reduce latency in this scenario, configure the system with event-mode=2.

Info

event-mode=2, also known as Event Forward Mode, allows DPDK applications to offload DMA operations to a DMA adapter while preserving the correct order of ingress packets. This configuration is designed to optimize performance and improve latency in environments with high numbers of SFs.


testpmd Startup Failure in ConnectX-6 Dx KVM Setup

Symptom : When running testpmd with the parameter tx_pp=500, the application fails to start and exits with the following error:

Copy
Copied!
            

dpdk-testpmd -n 4  -w 0000:00:07.0,l3_vxlan_en=1,tx_pp=500,dv_flow_en=1  -w ...

Error output:

Copy
Copied!
            

EAL: Probe PCI driver: mlx5_pci (15b3:101e) device: 0000:00:07.0 (socket 0) mlx5_net: WQE rate mode is required for packet pacing mlx5_net: probe of PCI device 0000:00:07.0 aborted after encountering an error: No such device

Resolution : The issue occurs because the REAL_TIME_CLOCK_ENABLE parameter is not configured. This parameter is required for packet pacing functionality on NVIDIA ConnectX adapters.

Ensure that the REAL_TIME_CLOCK_ENABLE parameter in mlxconfig is set to 1. This activates the real-time timestamp format on NVIDIA ConnectX adapters, providing timestamps relative to the Unix epoch, which are necessary for packet pacing.

Configuration command:

Copy
Copied!
            

mlxconfig -d <device_id> set REAL_TIME_CLOCK_ENABLE=1

Verify the parameter is enabled and re-run the testpmd application.

TCP Hardware Hairpin Connection Forwarding Packets to Host

Symptom : After passing through the Connection Tracking (CT) check, certain TCP packets exhibit the following behavior:

  • They do not have the RTE_FLOW_CONNTRACK_PKT_STATE_VALID or RTE_FLOW_CONNTRACK_PKT_STATE_CHANGED flags set.

  • They do not have the RTE_FLOW_CONNTRACK_PKT_STATE_DISABLED flag set either.

  • Despite this, these packets are being forwarded to the host.

Note

All CT objects are created with liberal mode set to 1.

Resolution : To resolve this issue, adjust the configuration of the conntrack object by following one of these approaches :

  1. Set max_win to 0 for both the original and reply directions

  2. Configure max_win with the appropriate values for both directions, as recommended in the release notes and header files

OVS-DPDK LAG - Configuration Mismatch for dv_xmeta_en

Symptom : When adding vDPA ports to OVS from port1 and port2, an error is encountered in the OVS log.

Setup used:

Copy
Copied!
            

dpdk-extra="-w 0000:86:00.0,representor=pf0vf[0-15],dv_xmeta_en=1,dv_flow_en=1,dv_esw_en=1 -w 0000:86:00.0,representor=pf1vf[0-15],dv_xmeta_en=1,dv_flow_en=1,dv_esw_en=1" dpdk-init="true" dpdk-socket-mem="8192,8192" hw-offload="true"

Error printed in OVS log:

Copy
Copied!
            

Jul 04 12:24:24 qa-r-vrt-123 ovs-vsctl[23091]: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0-ovs vdpa16 -- set Interface vdpa16 type=dpdkvdpa options:vdpa-socket-path=/tmp/sock16 options:vdpa-accelerator-devargs=0000:86:08.2 options:dpdk-devargs=0000:86:00.0,representor=pf1vf[0] Jul 04 12:24:24 qa-r-vrt-123 ovs-vswitchd[22545]: ovs|00570|dpdk|ERR|mlx5_net: "dv_xmeta_en" configuration mismatch for shared mlx5_bond_0 context

Info

The issue does not occur if dv_xmeta_en=1 is removed during initialization.

Resolution :

The problem arises because DPDK does not probe the same PCIe address with different devargs settings. Consequently:

  • The representors of PF1 are ignored

  • OVS treats these representors as new ports and probes them independently, resulting in a configuration mismatch

To resolve the issue, configure both PF representors using a single devargs entry (e.g., representor=pf[0-1]vf[0-15]). This configuration ensures that both PF representors are correctly identified and probed without causing a mismatch in the dv_xmeta_en configuration.

Issue Binding Hairpin Queues After Enabling Port Loopback

Symptom : When running testpmd in hairpin mode, the system fails to bind hairpin queues after enabling loopback mode on a port.

Configuration steps :

Copy
Copied!
            

testpmd> port stop 1  testpmd> port config 1 loopback 1 testpmd> port start 1 

Resolution : To resolve this issue, ensure that all ports are stopped before applying the loopback configuration. Use the following steps:

  1. Stop all ports:

    Copy
    Copied!
                

    port stop all

  2. Configure loopback mode on the desired port:

    Copy
    Copied!
                

    port config 1 loopback 1

  3. Restart all ports:

    Copy
    Copied!
                

    port start all

By stopping all ports before applying the loopback configuration, the system can correctly bind hairpin queues.

Memory Allocation Failure

Symptom : When running testpmd with iova-mode=pa, initialization fails with the following errors:

Copy
Copied!
            

EAL: Selected IOVA mode 'PA' EAL: No available 1048576 kB hugepages reported Fail to start port 0: Cannot allocate memory Fail to start port 1: Cannot allocate memory

Resolution : The memory fragmentation issue can be resolved by using iova-mode=va instead. This mode utilizes virtual addressing, which can handle memory fragmentation more effectively.

VDPA Rx Packet Truncation with --enable-scatter

Symptom : Packets are truncated when configuring the maximum packet length with the following testpmd settings:

Copy
Copied!
            

dpdk-testpmd -n 4 -w 0000:04:00.0,representor=[0,1],dv_xmeta_en=0,txq_inline=290,rx_vec_en=1,l3_vxlan_en=1,dv_flow_en=1,dv_esw_en=1 -- .. --enable-scatter ..

After changing the MTU and maximum packet length:

Copy
Copied!
            

port stop all port config all max-pkt-len 8192 port start all port config mtu 0 8192 port config mtu 1 8192 port config mtu 2 8192

Resolution : The packets are truncated to the default size of 1518 bytes. This issue is typically caused by the guest VM's MTU settings rather than the host configuration .

To resolve this issue:

  1. Set the MTU value using the host_mtu parameter when launching the VM:

    Copy
    Copied!
                

    -device virtio-net-pci,netdev=netdev0,mac=52:54:00:00:00:01,mrg_rxbuf=on,host_mtu=9000

    This sets the MTU for the virtio-net device to 9000 bytes.

  2. Inside the guest VM, verify the MTU value using the ifconfig command to ensure it matches the specified value (e.g., 9000 bytes).

  3. If a different MTU value is required, adjust both the --max-pkt-len parameter in the testpmd command on the host and the host_mtu parameter in the QEMU command for the guest to match the desired MTU.

Ring Memory Issue

Symptom : The following memory error is encountered:

Copy
Copied!
            

RING: Cannot reserve memory [13:00:57:290147][DOCA][ERR][UFLTR::Core:156]: DPI init failed

Resolution : This is a common memory issue when running the application on the host, often caused by insufficient memory (i.e., not enough huge pages allocated per worker thread).

Possible solutions:

  • Recommended – increase the amount of allocated huge pages. Refer to the relevant documentation for instructions on allocating huge pages.

  • Alternative solution – limit the number of cores used by the application. Use one of the following options:

    • -c <core-mask> – set the hexadecimal bitmask of the cores to run on

    • -l <core-list> – list of cores to run on

      Example command:

      Copy
      Copied!
                  

      ./doca_<app_name> -a 3b:00.3 -a 3b:00.4 -l 0-64 -- -l 60

DOCA Apps Using DPDK in Parallel Issue

Symptom : When running two DOCA applications in parallel that use DPDK, the first application runs successfully, but the second fails with the following error:

Copy
Copied!
            

Failed to start URL Filter with output: EAL: Detected 16 lcore(s) EAL: Detected 1 NUMA nodes EAL: RTE Version: 'MLNX_DPDK 20.11.4.0.3' EAL: Detected shared linkage of DPDK EAL: Cannot create lock on '/var/run/dpdk/rte/config'. Is another primary process running? EAL: FATAL: Cannot init config EAL: Cannot init config [15:01:57:246339][DOCA][ERR][NUTILS]: EAL initialization failed

Resolution : The error occurs because the second application attempts to use /var/run/dpdk/rte/config, which is already in use by the first application.

To resolve this issue, run the second application with the DPDK EAL option --file-prefix <name>. This option specifies a unique file prefix for the second application to avoid conflicts.

The following example command starts the second application after starting the first application (without the --file-prefix option):

Copy
Copied!
            

./doca_<app_name> --file-prefix second -a 0000:01:00.6,sft_en=1 -a 0000:01:00.7,sft_en=1 -v -c 0xff -- -l 60


Failure to Set Huge Pages

Symptom : A permission error is raised when trying to configure huge pages from an unprivileged user account.

The error :

Copy
Copied!
            

$ sudo echo 600 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages -bash: /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages: Permission denied

Resolution: Using sudo with echo behaves differently than expected. To configure huge pages correctly, use the following command instead:

Copy
Copied!
            

$ echo '600' | sudo tee -a /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages


© Copyright 2025, NVIDIA. Last updated on Jul 17, 2025.