What can I help you with?
NVIDIA BlueField Platform Software Troubleshooting Guide

On This Page

MLNX_DPDK

This page offers troubleshooting information for DPDK users and customers. It is advisable to review the following resources:

Command

Description

ibdev2netdev -v

Part of the OFED package. This command displays all associations between network devices and Remote Direct Memory Access (RDMA) adapter ports.

lspci

A Linux command that provides information about each PCI bus on your system.

ethtool

A Linux command used to query or control network driver and hardware settings.

ip,devlink

  • The ip command is used to assign addresses to network interfaces and configure network parameters on Linux systems. It replaces the outdated and deprecated ifconfig command on modern Linux distributions.

  • devlink is an API for exposing device information and resources not directly related to any specific device class, such as chip-wide or switch-ASIC-wide configurations

meson -Denable_drivers=*/mlx5,mempool/bucket,bus/auxiliary,mempool/ring,mempool/stack <build_dir> && ninja -C <build_dir>

Builds DPDK for MLX5

echo -n $vf0_pci > /sys/bus/pci/drivers/mlx5_core/unbind

echo -n $vf1_pci > /sys/bus/pci/drivers/mlx5_core/unbind

devlink dev eswitch set pci/${pci_addr} mode switchdev

echo $vf_num > /sys/bus/pci/devices/${pci_addr}/sriov_numvfs

# Or

echo $vf_num >/sys/bus/pci/devices/$pci/mlx5_num_vfs

echo -n $vf0_pci > /sys/bus/pci/drivers/mlx5_core/bind

echo -n $vf1_pci > /sys/bus/pci/drivers/mlx5_core/bind

Sets switchdev mode with 2 VF's.

Note

The mlx5_num_vfs parameter is always present, regardless of whether the OS has loaded the virtualization module (such as when adding intel_iommu support to the grub file). In contrast, the sriov_numvfs parameter is applicable only if the intel_iommu has been added to the grub file. If you do not see the sriov_numvfs file, verify that intel_iommu was correctly added to the grub configuration.

Compilation Debug Flags

Debug Flag

Description

-Dbuildtype=debug

Sets the build type as debug, which enables debugging information and disables optimizations.

-Dc_args='-DRTE_ENABLE_ASSERT -DRTE_LIBRTE_MLX5_DEBUG'

Activates assertion checks and enables debug messages for MLX5.


Steering Dump Tool

This tool triggers the application to dump its specific data, as well as triggers the hardware to dump the associated hardware data For additional details, refer to mlx_steering_dump.

dpdk-proc-info Application

This application operates as a DPDK secondary process and can:

  • Retrieve port statistics

  • Reset port statistics

  • Print DPDK memory information

  • Display debug information for ports

For more information, refer to DPDK proc-info Guide.

Reproducing an Issue with testpmd Application

Use the testpmdapplication to test simplified scenarios. For guidance, refer to the testpmd User Guide.

DPDK Unit Test Tool

This tool performs DPDK unit testing utilizing the testpmdand Scapy applications. For more information, refer to DPDK Unit Test Tool.

Memory Errors

Issue: I encountered this memory error during the startup of testpmd. What does it indicate?

Copy
Copied!
            

EAL: No free 2048 kB hugepages reported on node 0 EAL: FATAL: Cannot get hugepage information. EAL: Cannot get hugepage information.

Resolution : No hugepages have been allocated. To configure hugepages, run:

Copy
Copied!
            

sysctl vm.nr_hugepages=<#huge_pages>

Issue: I encountered this memory error during the startup of testpmd. What does it indicate?

Copy
Copied!
            

mlx5_common: mlx5_common_utils.c:420: mlx5_hlist_create(): No memory for hash list mlx5_0_flow_groups creation

Resolution : If there are missing free, check the availability of free hugepages by running:

Copy
Copied!
            

grep -i hugepages /proc/meminfo


Compatibility Issues Between Firmware and Driver

Issue: T he mlx5 driver is not loading. What might be the problem?

Resolution : The issue may be due to a compatibility mismatch between the firmware and the driver. When this occurs, the driver will fail to load, and an error message will appear in the dmesg output. To address this, verify that the firmware version matches the driver version. Refer to the NVIDIA MLNX_OFED Documentation for details on supported firmware and driver versions.

Restarting the Driver After Removing a Physical Port

Issue: I removed a physical port from an OVS-DPDK bridge while offload was enabled, and now I am encountering issues. What should I do?

Resolution : When offload is enabled, removing a physical port from an OVS-DPDK bridge requires restarting the OVS service. Failure to do so can lead to incorrect datapath rule configurations. To resolve this, restart the openvswitch service after reattaching the physical port to a bridge per your desired topology.

Limitations of the dec_ttl Feature

Issue: The dec_ttlfeature in OVS-DPDK is not working. What could be the problem?

Resolution : The dec_ttlfeature is only supported on ConnectX-6 adapters and is not compatible with the ConnectX-5 adapters. There is no workaround for this limitation.

Deadlock When Moving to switchdev Mode

Issue: I am experiencing a deadlock when moving to switchdev mode while deleting a namespace. How can I resolve this?

Resolution : To avoid the deadlock, unload the mlx5_ibmodule before moving to switchdev mode.

Unusable System After Unloading the mlx5_core Driver

Issue: I am running my system from a network boot and using an NVIDIA ConnectX card to connect to network storage. When I unload the mlx5_coredriver, the system becomes unresponsive. What should I do?

Resolution : Unloading the mlx5_coredriver (e.g., by running /etc/init.d/openibd restart) while the system is running from a network boot and connected to network storage via an NVIDIA ConnectX card, causes system issues. To avoid this issue, it is best not to unload the mlx5_core driver under these circumstances, as there is no available workaround.

Incompatibility Between RHEL 7.6alt and CentOS 7.6alt Kernels

Issue: I am trying to install MLNX_OFED on a system with the CentOS 7.6alt kernel, but some kernel modules built for RHEL 7.6alt do not load. How can I fix this?

Resolution : The kernel used in CentOS 7.6alt (for non-x86 architectures) differs from that in RHEL 7.6alt. As a result, MLNX_OFED kernel modules compiled for the RHEL 7.6alt kernel may not load on a CentOS 7.6alt system. To resolve this issue, you build the kernel modules specifically for the CentOS 7.6alt kernel.

Cannot Add VF 0 Representor

Issue: I am encountering an error when trying to add VF 0 representor: mlx5_pci port query failed: Input/output error. What should I do?

Resolution : To resolve this issue, ensure that the VF configuration is completed before starting the DPDK application.

EAL Initialization Failure

EAL initialization failure is a common error that may appear while running various DPDK-related applications.

The error appears like this:

Copy
Copied!
            

[DOCA][ERR][NUTILS]: EAL initialization failed

There may be many causes for this error. Some of them are as follows:

  • The application requires huge pages and none were allocated

  • The application requires root privileges to run and it was run without elevated privileges

The following solutions are respective to the possible causes listed above:

  • Allocate huge pages. For example, run (on the host or the DPU, depending on where you are running the application):

    Copy
    Copied!
                

    $ echo '2048' | sudo tee -a /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages $ sudo mkdir /mnt/huge $ sudo mount -t hugetlbfs -o pagesize=2M nodev /mnt/huge

  • Run the application using sudo (or as root):

    Copy
    Copied!
                

    sudo <run_command>

DPDK EAL Limitation with More than 128 Cores

Issue: I am running with 190 cores, but DPDK detects only 128:

Copy
Copied!
            

dpdk/bin/dpdk-test-compress-perf -a 0000:11:00.0,class=compress -l 0,190 -- ... EAL: Detected 128 lcore(s) EAL: Detected 4 NUMA nodes EAL: invalid core list syntax

Resolution : To address this issue, compile DPDK with the parameter -Dmax_lcores=256. This will enable DPDK to recognize the additional cores.

Packets Dropped When Transferring to UDP Port 4789

Issue: I am running testpmd and created the following rules:

Copy
Copied!
            

flow create 1 priority 1 transfer ingress group 0 pattern eth / vlan / ipv4 / udp dst is 4789  / vxlan / end actions  jump group 820 / end flow create 1 priority 0 transfer ingress group 820 pattern eth / vlan / ipv4 / udp dst spec 4789  dst mask 0xfff0  / vxlan / eth / vlan / ipv4 / end actions modify_field op set dst_type udp_port_dst src_type value src_value 0x168c9e8d4df8f width 2 /  port_id id 0 / end

I am sending a packet that matches these rules, but the packet is dropped and not received on port 0 peer. What could be the issue?

Resolution : UDP port 4791 is reserved for RoCE (RDMA over Converged Ethernet) traffic. By default, RoCE is enabled on all mlx5 devices, and traffic to UDP port 4791 is treated as RoCE traffic. To forward traffic to this port for Ethernet use (without RDMA), you need to disable RoCE. You can do this by running:

Copy
Copied!
            

echo <0|1> > /sys/devices/{pci-bus-address}/roce_enable

Refer to the MLNX_OFED Documentation for more details on disabling RoCE.

Ping Loss with 254 vDPA Devices

Issue: I am experiencing intermittent ping loss or errors with some interfaces. How can I reproduce this issue?

Reproduction Steps:

  • Create 127 VFs on each port.

  • Configure VF link aggregation (lag).

  • Start the vDPA example with 254 ports.

  • Launch 16 VMs, each with up to 16 interfaces.

  • Send pings from each vDPA interface to an external host.

Resolution : Adding the runtime configuration event_core=x resolves this issue. The event_core parameter specifies the CPU core number for the timer thread, with the default being the EAL main lcore.

Note

The event_core can be shared among different mlx5 vDPA devices. However, using this core for additional tasks may impact the performance and latency of the mlx5 vDPA devices.


No TX Traffic with testpmd and HWS dv_flow=2

Issue: I am observing that packets are not being forwarded in forward mode. I started testpmd with the following command:

Copy
Copied!
            

dpdk-testpmd -n 4 -a 08:00.0,representor=0,dv_flow_en=2 -- -i --forward-mode=txonly -a

Resolution : To address this issue, ensure that HWS queues are configured for all the ports (representor=0-1) and the PFs. This configuration is necessary to set the default flows that will be utilized.

OVS-DPDK: Duplicate Packets When Co-running with SPDK

Issue: Duplicate packets are observed when running OVS-DPDK with offload alongside SPDK, which also attaches to the same PF. What can be done to fix this?

Resolution : To resolve this issue, set dv_esw_en=0 on the OVS-DPDK side. This disables E-Switch using Direct Rules, which is enabled by default if supported.

Live Migration VM Stuck in PMD MAC Swap Mode

Issue: Running the following scenario leads to an endless loop:

Scenario:

  1. Open 64 VFs.

  2. Configure OVS-DPDK [VF].

  3. Start a VM with 16 devices.

  4. Initiate traffic between VMs in PMD MAC swap mode.

On Hyper-V, execute:

Copy
Copied!
            

 virsh migrate --live --unsafe --persistent --verbose qa-r-vrt-123-009-CentOS-8.2  qemu+ssh://qa-r-vrt-125/system tcp://22.22.22.125

Output:

Copy
Copied!
            

Migration: [ 75 %]  //repeat in a loop

Resolution : This is a known issue. To address it, add the auto-converge parameter for heavy vCPU/traffic loads, as live migration may not be complete otherwise. Run:

Copy
Copied!
            

virsh migrate ... --auto-converge --auto-converge-initial 60 --auto-converge-increment 20


Failure to Create Rule with Raw Encapsulation in Switchdev Mode

Issue: I encountered an error while trying to create a rule with encapsulation on port 0:

Copy
Copied!
            

mlx5_net: [mlx5dr_action_create_reformat_root]: Failed to create dv_create_flow reformat mlx5_net: [mlx5dr_action_create_reformat]: Failed to create root reformat action Template table #0 destroyed port_flow_complain(): Caught PMD error type 1 (cause unspecified): fail to create rte table: Operation not supported

Resolution : To resolve this issue, disable encapsulation with the following command:

Copy
Copied!
            

echo none > /sys/class/net/<ethx>/compat/devlink/encap

If you need to offload tunnels in VFs/SFs, disable encapsulation in the FDB domain, as the NIC does not support it in both domains simultaneously.

Unable to Create Flow When Having L3 VXLAN With External Process

Issue: After detaching and reattaching ports, I encounter the following error when attempting to match on L3 VXLAN:

Detach Ports:

Copy
Copied!
            

port stop all device detach 0000:08:00.0 device detach 0000:08:00.1

Re-attach Ports:

Copy
Copied!
            

mlx5 port attach 0000:08:00.0 socket=/var/run/external_ipc_socket mlx5 port attach 0000:08:00.1 socket=/var/run/external_ipc_socket port start all

Create a Rule:

Copy
Copied!
            

testpmd> flow create 1 priority 2 ingress group 0 pattern eth dst is 00:16:3e:23:7c:0c has_vlan spec 1 has_vlan mask 1 / vlan vid is 1357 ... / ipv6 src is ::9247 ... / udp dst is 4790 / vxlan-gpe / end actions queue index 65000 / end

Got Error:

Copy
Copied!
            

port_flow_complain(): Caught PMD error type 13 (specific pattern item): cause: 0x7ffd041c6be0, L3 VXLAN is not enabled by device parameter and/or not configured in firmware: Operation not supported

Resolution : The error occurs because the l3_vxlan_en parameter is not set when attaching the device. This parameter must be specified to enable L3 VXLAN and VXLAN-GPE flow creation.

To fix this issue:

  1. Ensure that the l3_vxlan_en parameter is set to a nonzero value when attaching the device. This will allow L3 VXLAN and VXLAN-GPE flow creation.

  2. Configure the firmware to support L3 VXLAN or VXLAN-GPE, as this is a prerequisite for handling this type of traffic. By default, this parameter is disabled.

Make sure these configurations are in place to support L3 VXLAN operations.

Buffer Split - Failure to Configure max-pkt-len

Issue: When running testpmd and trying to change the max-pkt-len configuration, I encounter the following error:

Copy
Copied!
            

dpdk-testpmd -n 4 -a 0000:00:07.0,dv_flow_en=1,... -a 0000:00:08.0,dv_flow_en=1,... -- --mbcache=512 -i --nb-cores=15 --txd=8192 --rxd=8192 --burst=64 --mbuf-size=177,430,417 --enable-scatter --tx-offloads=0x8000 --mask-event=intr_lsc port stop all port config all max-pkt-len 9216 port start all

Error:

Copy
Copied!
            

Configuring Port 0 (socket 0): mlx5_net: port 0 too many SGEs (33) needed to handle requested maximum packet size 9216, the maximum supported are 32 mlx5_net: port 0 unable to allocate queue index 0 Fail to configure port 0 rx queues

Resolution : To resolve this issue, start testpmd with rx-offloads=0x102000 to enable buffer split. Then, configure the receive packet size by using the set rxpkts (x[,y]*) command, where x[,y]* is a CSV list of values, and zero value indicates using the memory pool data buffer size. For example, use:

Copy
Copied!
            

set rxpkts 49,430,417


Unable to Start testpmd with Shared Library

Issue: When running testpmd from the download folder, I encounter the following error:

Copy
Copied!
            

/tmp/dpdk/build-meson/app/dpdk-testpmd -n 4 -w ...

Error:

Copy
Copied!
            

EAL: Error, directory path /tmp is world-writable and insecure EAL: FATAL: Cannot init plugins EAL: Cannot init plugins EAL: Error - exiting with code: 1 Cause: Cannot init EAL: Invalid argument

Resolution : The issue is due to improper permissions for the directory where DPDK is located. Rebuilding DPDK in a directory with stricter permissions should resolve this problem.

Cross-Port Action: Failure to Create Actions Template

Issue: While trying to configure two ports where port 1 is the host port for port 0, I encounter an error when creating the action template on port 0.

Testpmd reproduction:

Copy
Copied!
            

/download/dpdk/install/bin/dpdk-testpmd -n 4 -a 0000:08:00.0,dv_flow_en=2,dv_xmeta_en=0 -a 0000:08:00.1,dv_flow_en=2,dv_xmeta_en=0 --iova-mode="va" -- --mbcache=512 -i  --nb-cores=7  --rxq=8 --txq=8 --txd=2048 --rxd=2048

Copy
Copied!
            

port stop all    flow configure 0 queues_number 16 queues_size 256 counters_number 0 host_port 1 flags 2  flow configure 1 queues_number 16 queues_size 256 counters_number 8799  port start all 

Error:

Copy
Copied!
            

flow queue 1 indirect_action 5 create postpone false action_id 3 ingress action count / end  flow push 1 queue 5  flow pull 1 queue 5  flow actions_template 0 create actions_template_id 8 template shared_indirect 1 3 / end mask count / end    Actions template #8 destroyed port_flow_complain(): Caught PMD error type 16 (specific action): cause: 0x7ffd5610a210, counters pool not initialized: Invalid argument

Resolution : C onfigure the ports in the correct order, with the host port configured first. Update the configuration as follows:

Copy
Copied!
            

port stop all flow configure 1 queues_number 16 queues_size 256 counters_number 8799 flow configure 0 queues_number 16 queues_size 256 counters_number 0 host_port 1 flags 2 port start all


RSS Hashing Issue with GRE Traffic

Issue: GRE traffic is not being RSS hashed to multiple cores when received on a CX6 interface, although it works correctly with the same traffic on an Intel i40e interface.

Resolution : To enable RSS hashing for GRE traffic over inner headers, you need to create a flow rule that specifically matches tunneled packets. For example:

Pattern: ETH / IPV4 / GRE / END

Actions: RSS(level=2, types=…) / END

If you attempt to use RSS level 2 in a flow rule without including a tunnel header item, you will encounter an error: “inner RSS is not supported for non-tunnel flows.”

Unable to Probe SF/VF Device with crypto-perf-test App on Arm

Issue: I encounter this error when attempting to probe an SF on startup on an Arm host, using the following command:

Copy
Copied!
            

dpdk-test-crypto-perf -c 0x7ff -a auxiliary:mlx5_core.sf.1,class=crypto,algo=1 --  --ptest verify --aead-op decrypt --optype aead --aead-algo aes-gcm ....

Error:

Copy
Copied!
            

No crypto devices type mlx5_pci available USER1: Failed to initialise requested crypto device type

Additionally, when attempting to use the representor keyword to probe SF or VF, I receive:

Copy
Copied!
            

mlx5_common: Key "representor" is unknown for the provided classes.

Resolution : On Arm platforms, only VF/SF representors are available, not the actual PCI devices. As a result, probing with the representor keyword or using real PCI device types is not supported on Arm.

Failed to Create ESP Matcher on VFs in Template Mode

Issue: When attempting to create a pattern template on a VF using testpmd, I encountered the following error:

Copy
Copied!
            

dpdk-testpmd -n 4  -a 0000:08:00.2,...,dv_flow_en=2,dv_xmeta_en=0  -a 0000:08:00.4,...,dv_flow_en=2,... -- --mbcache=512 -i  --nb-cores=7  ... port stop all  flow configure 0 queues_number 7 queues_size 256 meters_number 0 counters_number 0 quotas_number 256  flow configure 1 queues_number 14 queues_size 256 meters_number 0 counters_number 7031 quotas_number 64  port start all    flow pattern_template 0 create ingress pattern_template_id 0 relaxed no template eth / ipv4 /  esp /  end  flow actions_template 0 create actions_template_id 0 template jump group 88 / end mask jump group 88 / end  flow template_table 0 create table_id 0 group 72 priority 0 ingress rules_number 64 pattern_template 0 actions_template 0    mlx5_net: [mlx5dr_definer_conv_items_to_hl]: Failed processing item type: 23 mlx5_net: [mlx5dr_definer_calc_layout]: Failed to convert items to header layout mlx5_net: [mlx5dr_definer_matcher_init]: Failed to calculate matcher definer layout mlx5_net: [mlx5dr_matcher_bind_mt]: Failed to set matcher templates with match definers mlx5_net: [mlx5dr_matcher_create]: Failed to initialise matcher: 95

Resolution : Currently, IPsec offload can only be supported on a single path—either on PF, VF, or E-Switch. DPDK does not yet support IPsec on VFs; it is limited to PFs and E-Switch configurations. Consequently, configuring IPsec on VFs is not supported at this time.

Failure to Receive Hairpin Traffic (HWS) Between Two Physical Ports

Issue: I am running testpmd in hairpin mode with the following setup: Port 0 is configured to create simple rules, and I am sending traffic on Port 0 while measuring the traffic on Port 1. However, I am not seeing any traffic on Port 1.

Setup:

  • Two BlueField-2 devices (DUT and TG) connected back-to-back via symmetrical interfaces: p0 and p1.

  • Firmware is configured for HWS.

Steps:

  1. Run DUT in Hairpin Mode:

    Copy
    Copied!
                

    dpdk-testpmd -c 0xff -n 4 -a 0000:03:00.0,dv_flow_en=2 -a 0000:03:00.1,dv_flow_en=2 ... --forward-mode=rxonly -i --hairpinq 1 --hairpin-mode=0x12

  2. Configure DUT:

    Copy
    Copied!
                

    port stop all flow configure 0 queues_number 4 queues_size 64 port start all   flow pattern_template 0 create pattern_template_id 0 ingress template eth / end flow actions_template 0 create actions_template_id 0 template jump group 1 / end mask jump group 1 / end flow template_table 0 create table_id 0 group 0 priority 0 ingress rules_number 64 pattern_template 0 actions_template 0 flow queue 0 create 0 template_table 0 pattern_template 0 actions_template 0 postpone no pattern eth / end actions jump group 1 / end flow pull 0 queue 0   flow pattern_template 0 create pattern_template_id 2 ingress template eth / end flow actions_template 0 create actions_template_id 2 template queue / end mask queue / end flow template_table 0 create table_id 2 group 1 priority 0 ingress rules_number 64 pattern_template 2 actions_template 2 flow queue 0 create 0 template_table 2 pattern_template 0 actions_template 0 postpone no pattern eth / end actions queue index 4 / end flow pull 0 queue 0   start

  3. Send Traffic from TG on Port 0:

    Copy
    Copied!
                

    /tmp/dpdk-hws/app/dpdk-testpmd -c 0xff -n 4 -a 0000:03:00.0,dv_flow_en=1 --socket-mem=2048 -- --port-numa-config=0 --socket-num=0 --burst=64 --txd=1024 --rxd=1024 --mbcache=512 --rxq=4 --txq=4 --nb-cores=1 --forward-mode=txonly --no-lsc-interrupt -i

  4. Measure Traffic Rate on TG Interface p1:

    Copy
    Copied!
                

    mlnx_perf -i p1

Expected Result: 2.5 Gbps

Actual Result: None

Resolution : Ensure that both ports are configured with HWS queues:

Copy
Copied!
            

port stop all flow configure 0 queues_number 4 queues_size 64 flow configure 1 queues_number 4 queues_size 64 port start all

Without proper HWS configuration on Port 1, default SQ miss rules are not inserted, which likely causes the absence of traffic on Port 1.

DPDK-OVS Memory Allocation Error

Issue: I encountered this error during initialization:

Copy
Copied!
            

dpdk|ERR|EAL: eal_memalloc_alloc_seg_bulk(): couldn't find suitable memseg_list

Resolution : In recent updates to OVS and DPDK:

  • The EAL argument --socket-mem is no longer configured by default on start-up. If dpdk-socket-mem and dpdk-alloc-mem are not explicitly specified, DPDK will revert to its default settings.

  • The EAL argument --socket-limit no longer defaults to the value of --socket-mem. To maintain the previous memory-limiting behavior, you should set other_config:dpdk-socket-limit to the same value as other_config:dpdk-socket-mem.

Ensure that you provide the dpdk-socket-limit parameter, which can be set to match the dpdk-socket-mem value to avoid such errors.

High Latency in VDPA Ping Traffic Between VMs with 240 SFs

Issue: Ping traffic between VMs exhibits high latency when configured with 240 SFs.

Resolution : To achieve better latency results in this scenario, configure the system with event-mode=2.

DPDK's event-mode=2, also known as Event Forward mode, allows DPDK applications to offload DMA operations to a DMA adapter while maintaining the correct order of ingress packets.

testpmd Startup Failure in ConnectX-6 Dx KVM Setup

Issue: When running testpmd with the parameter tx_pp=500, the application exits with the following error:

Copy
Copied!
            

dpdk-testpmd -n 4  -w 0000:00:07.0,l3_vxlan_en=1,tx_pp=500,dv_flow_en=1  -w ...

Result:

Copy
Copied!
            

EAL: Probe PCI driver: mlx5_pci (15b3:101e) device: 0000:00:07.0 (socket 0) mlx5_net: WQE rate mode is required for packet pacing mlx5_net: probe of PCI device 0000:00:07.0 aborted after encountering an error: No such device

Resolution : Ensure that the REAL_TIME_CLOCK_ENABLE parameter in mlxconfig is set to 1.

The REAL_TIME_CLOCK_ENABLE parameter activates the real-time timestamp format on Mellanox ConnectX network adapters, which provides timestamps relative to the Unix epoch and is required for packet pacing functionality.

Please make sure the mlxconfig param for REAL_TIME_CLOCK_ENABLE is set to 1.

TCP Hardware Hairpin Connection Forwarding Packets to Host

Issue: Some TCP packets, after passing through the Connection Tracking (CT) check, do not have either the RTE_FLOW_CONNTRACK_PKT_STATE_VALID or RTE_FLOW_CONNTRACK_PKT_STATE_CHANGED flags set. They also lack the RTE_FLOW_CONNTRACK_PKT_STATE_DISABLED flag, yet I observe that these packets are being forwarded to the host. How can I determine why these packets are being forwarded to the host or why the CT check is failing? Note that all CT objects are created with liberal mode set to 1.

Resolution : To address this issue, adjust the configuration of the conntrack object in one of the following ways:

  1. Set max_win to 0 for both the original and reply directions.

  2. Configure max_win with the appropriate values for both directions, as recommended in the release notes and header files.

OVS-DPDK LAG - Configuration Mismatch for "dv_xmeta_en"

Issue: When configuring OVS with vDPA ports, the following setup was used:

Copy
Copied!
            

dpdk-extra="-w 0000:86:00.0,representor=pf0vf[0-15],dv_xmeta_en=1,dv_flow_en=1,dv_esw_en=1 -w 0000:86:00.0,representor=pf1vf[0-15],dv_xmeta_en=1,dv_flow_en=1,dv_esw_en=1" dpdk-init="true" dpdk-socket-mem="8192,8192" hw-offload="true"

When adding vDPA ports to OVS from port1 and port2, the following error was encountered in the OVS log:

Copy
Copied!
            

Jul 04 12:24:24 qa-r-vrt-123 ovs-vsctl[23091]: ovs|00001|vsctl|INFO|Called as ovs-vsctl add-port br0-ovs vdpa16 -- set Interface vdpa16 type=dpdkvdpa options:vdpa-socket-path=/tmp/sock16 options:vdpa-accelerator-devargs=0000:86:08.2 options:dpdk-devargs=0000:86:00.0,representor=pf1vf[0] Jul 04 12:24:24 qa-r-vrt-123 ovs-vswitchd[22545]: ovs|00570|dpdk|ERR|mlx5_net: "dv_xmeta_en" configuration mismatch for shared mlx5_bond_0 context

The issue does not occur if dv_xmeta_en=1 is removed during initialization.

Resolution : The problem arises because DPDK does not probe the same PCI address with different devargs settings. Consequently, the representors of PF1 are ignored, leading OVS to treat them as new ports and probe them accordingly.

To resolve this, configure both PF representors using a single devargs entry, like so: representor=pf[0-1]vf[0-15]. This way, both PF representors will be correctly identified and probed without causing a configuration mismatch.

Issue Binding Hairpin Queues After Enabling Port Loopback

Issue: When running testpmd in hairpin mode, the system fails to bind hairpin queues after configuring the port for loopback mode. The configuration steps were as follows:

Copy
Copied!
            

testpmd> port stop 1  testpmd> port config 1 loopback 1 testpmd> port start 1 

Resolution : To resolve this issue, ensure that all ports are stopped before applying the loopback configuration. Follow these steps:

Copy
Copied!
            

port stop all port config 1 loopback 1 port start all


Memory Allocation Failure

Issue: When running testpmd with iova-mode=pa, initialization fails with the following errors:

Copy
Copied!
            

EAL: Selected IOVA mode 'PA' EAL: No available 1048576 kB hugepages reported Fail to start port 0: Cannot allocate memory Fail to start port 1: Cannot allocate memory

Resolution : The memory fragmentation issue can be resolved by using iova-mode=va instead. This mode utilizes virtual addressing, which can handle memory fragmentation more effectively.

VDPA Rx Packet Truncation with --enable-scatter

Issue: Packets are being truncated when configuring the maximum packet length with the following testpmd settings:

Copy
Copied!
            

dpdk-testpmd -n 4 -w 0000:04:00.0,representor=[0,1],dv_xmeta_en=0,txq_inline=290,rx_vec_en=1,l3_vxlan_en=1,dv_flow_en=1,dv_esw_en=1 -- .. --enable-scatter ..

After changing the MTU and maximum packet length:

Copy
Copied!
            

port stop all port config all max-pkt-len 8192 port start all port config mtu 0 8192 port config mtu 1 8192 port config mtu 2 8192

Resolution : T he issue is that packets are truncated to the default size of 1518 bytes. This problem typically stems from the guest VM's MTU settings rather than the host configuration.

To address this, update the MTU settings for the virtio-net device in the guest VM:

  1. Set the MTU value using the host_mtu parameter when launching the VM:

    Copy
    Copied!
                

    -device virtio-net-pci,netdev=netdev0,mac=52:54:00:00:00:01,mrg_rxbuf=on,host_mtu=9000

    This sets the MTU for the virtio-net device to 9000 bytes.

  2. Inside the guest VM, verify the MTU value with the ifconfig command to ensure it matches the specified value (9000 in this case).

  3. If a different MTU value is required, adjust both the --max-pkt-len parameter in the testpmd command on the host and the host_mtu parameter in the QEMU command for the guest accordingly.

Ring Memory Issue

Issue : The following memory error is encountered:

Copy
Copied!
            

RING: Cannot reserve memory [13:00:57:290147][DOCA][ERR][UFLTR::Core:156]: DPI init failed

Resolution : This is a common memory issue when running application on the host.

The most common cause for this error is lack of memory (i.e., not enough huge pages per worker thread).

Possible solutions:

  • Recommended: Increase the amount of allocated huge pages. Instructions for allocating huge pages can be found here.

  • Alternatively, one can also limit the number of cores used by the application:

    • -c <core-mask> – Set the hexadecimal bitmask of the cores to run on.

    • -l <core-list> – list of cores to run on.

  • For example:

    Copy
    Copied!
                

    ./doca_<app_name> -a 3b:00.3 -a 3b:00.4 -l 0-64 -- -l 60

DOCA Apps Using DPDK in Parallel Issue

Issue : When running two DOCA apps in parallel that use DPDK, the first app runs but the second one fails.

The following error is received:

Copy
Copied!
            

Failed to start URL Filter with output: EAL: Detected 16 lcore(s) EAL: Detected 1 NUMA nodes EAL: RTE Version: 'MLNX_DPDK 20.11.4.0.3' EAL: Detected shared linkage of DPDK EAL: Cannot create lock on '/var/run/dpdk/rte/config'. Is another primary process running? EAL: FATAL: Cannot init config EAL: Cannot init config [15:01:57:246339][DOCA][ERR][NUTILS]: EAL initialization failed

Resolution : The cause of the error is that the second application is using /var/run/dpdk/rte/config when the first application is already using it.

To run two applications in parallel, the second application must be run with DPDK EAL option --file-prefix <name>.

In this example, after running the first application (without adding the eal option), to run the second with the EAL option. Run:

Copy
Copied!
            

./doca_<app_name> --file-prefix second -a 0000:01:00.6,sft_en=1 -a 0000:01:00.7,sft_en=1 -v -c 0xff -- -l 60


Failure to Set Huge Pages

Issue : When trying to configure the huge pages from an unprivileged user account, a permission error is raised.

Configuring the huge pages results in the following error:

Copy
Copied!
            

$ sudo echo 600 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages -bash: /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages: Permission denied

Resolution: Using sudo with echo works differently than users usually expect. The command should be as follows:

Copy
Copied!
            

$ echo '600' | sudo tee -a /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages


© Copyright 2024, NVIDIA. Last updated on Nov 12, 2024.