NVIDIA DOCA Troubleshooting Guide

This guide provides troubleshooting information for common issues and misconfigurations encountered when using DOCA for NVIDIA® BlueField® DPU.

DOCA Infrastructure

RShim Troubleshooting and How-Tos

Another backend already attached

Several generations of BlueField DPUs are equipped with a USB interface in which RShim can be routed, via USB cable, to an external host running Linux and the RShim driver.

In this case, typically following a system reboot, the RShim over USB prevails and the DPU host reports RShim status as "another backend already attached". This is correct behavior, since there can only be one RShim backend active at any given time. However, this means that the DPU host does not own RShim access.

To reclaim RShim ownership safely:

Stop the RShim driver on the remote Linux. Run:

Copy
Copied!

            
            systemctl stop rshim
systemctl disable rshim

Restart RShim on the DPU host. Run:

Copy
Copied!

            
            systemctl enable rshim
systemctl start rshim

The "another backend already attached" scenario can also be attributed to the RShim backend being owned by the BMC in DPUs with integrated BMC. This is elaborated on further down on this page.

RShim driver not loading

Verify whether your DPU features an integrated BMC or not. Run:

Copy
Copied!

            
            # sudo sudo lspci -s $(sudo lspci -d 15b3: | head -1 | awk '{print $1}') -vvv | grep "Product Name"

Example output for DPU with integrated BMC:

Copy
Copied!

            
            Product Name: BlueField-2 DPU 25GbE Dual-Port SFP56, integrated BMC, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management, Tall Bracket, FHHL

If your DPU has an integrated BMC, refer to RShim driver not loading on host with integrated BMC.

If your DPU does not have an integrated BMC, refer to RShim driver not loading on host on DPU without integrated BMC.

RShim driver not loading on DPU with integrated BMC

RShim driver not loading on host

Access the BMC via the RJ45 management port of the DPU.

Delete RShim on the BMC:

Copy
Copied!

            
            systemctl stop rshim
systemctl disable rshim

Enable RShim on the host:

Copy
Copied!

            
            systemctl enable rshim
systemctl start rshim

Restart RShim service. Run:

Copy
Copied!

            
            sudo systemctl restart rshim

If RShim service does not launch automatically, run:

Copy
Copied!

            
            sudo systemctl status rshim

This command is expected to display "active (running)".

Display the current setting. Run:

Copy
Copied!

            
            # cat /dev/rshim<N>/misc | grep DEV_NAME
DEV_NAME        pcie-0000:04:00.2

This output indicates that the RShim service is ready to use.

RShim driver not loading on BMC

Verify that the RShim service is not running on host. Run:
Copy

Copied!
```
            
            systemctl status rshim
        
```
If the output is active, then it may be presumed that the host has ownership of the RShim.

Delete RShim on the host. Run:

Copy
Copied!

            
            systemctl stop rshim
systemctl disable rshim

Enable RShim on the BMC. Run:

Copy
Copied!

            
            systemctl enable rshim
systemctl start rshim

Display the current setting. Run:

Copy
Copied!

            
            # cat /dev/rshim<N>/misc | grep DEV_NAME
DEV_NAME        usb-1.0

This output indicates that the RShim service is ready to use.

RShim driver not loading on host on DPU without integrated BMC

Download the suitable DEB/RPM for RShim (management interface for DPU from the host) driver.

Reinstall RShim package on the host.

For Ubuntu/Debian, run:

Copy
Copied!

            
            sudo dpkg --force-all -i rshim-<version>.deb

For RHEL/CentOS, run:

Copy
Copied!

            
            sudo rpm -Uhv rshim-<version>.rpm

Restart RShim service. Run:

Copy
Copied!

            
            sudo systemctl restart rshim

If RShim service does not launch automatically, run:

Copy
Copied!

            
            sudo systemctl status rshim

This command is expected to display "active (running)".

Display the current setting. Run:

Copy
Copied!

            
            # cat /dev/rshim<N>/misc | grep DEV_NAME
DEV_NAME        pcie-0000:04:00.2

This output indicates that the RShim service is ready to use.

Change ownership of RShim from NIC BMC to host

Verify that your card has BMC. Run the following on the host:

Copy
Copied!

            
            # sudo sudo lspci -s $(sudo lspci -d 15b3: | head -1 | awk '{print $1}') -vvv |grep "Product Name"
Product Name: BlueField-2 DPU 25GbE Dual-Port SFP56, integrated BMC, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management, Tall Bracket, FHHL

The product name is supposed to show "integrated BMC".

Access the BMC via the RJ45 management port of the DPU.

Delete RShim on the BMC:

Copy
Copied!

            
            systemctl stop rshim
systemctl disable rshim

Enable RShim on the host:

Copy
Copied!

            
            systemctl enable rshim
systemctl start rshim

Restart RShim service. Run:

Copy
Copied!

            
            sudo systemctl restart rshim

If RShim service does not launch automatically, run:

Copy
Copied!

            
            sudo systemctl status rshim

This command is expected to display "active (running)".

Display the current setting. Run:

Copy
Copied!

            
            # cat /dev/rshim<N>/misc | grep DEV_NAME
DEV_NAME        pcie-0000:04:00.2

This output indicates that the RShim service is ready to use.

Connectivity Troubleshooting

Connection (ssh, screen console) to the DPU is lost

The UART cable in the Accessories Kit (OPN: MBF20-DKIT) can be used to connect to the DPU console and identify the stage at which BlueField is hanging.

Follow this procedure:

Connect the UART cable to a USB socket, and find it in your USB devices.
Copy

Copied!
```
            
            sudo lsusb
Bus 002 Device 003: ID 0403:6001 Future Technology Devices International, Ltd FT232 Serial (UART) IC
        
```
Note

For more information on the UART connectivity, please refer to the DPU's hardware user guide under Supported Interfaces > Interfaces Detailed Description > NC-SI Management Interface.

Info

It is good practice to connect the other end of the NC-SI cable to a different host than the one on which the BlueField DPU is installed.

Install the minicom application.

OS	Command
CentOS/RHEL	Copy Copied! `sudo yum install minicom -y`
Ubuntu/Debian	Copy Copied! `sudo apt-get install minicom`

Open the minicom application.

Copy
Copied!

            
            sudo minicom -s -c on

Go to "Serial port setup".
Enter "F" to change "Hardware Flow control" to NO.
Enter "A" and change to /dev/ttyUSB0 and press Enter.
Press ESC.
Type on "Save setup as dfl".
Exit minicom by pressing Ctrl + a + z.

Driver not loading in host server

What this looks like in dmsg:

Copy
Copied!

            
            [275604.216789] mlx5_core 0000:af:00.1: 63.008 Gb/s available PCIe bandwidth, limited by 8 GT/s x8 link at 0000:ae:00.0 (capable of 126.024 Gb/s with 16 GT/s x8 link)
[275624.187596] mlx5_core 0000:af:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 100s
[275644.152994] mlx5_core 0000:af:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 79s
[275664.118404] mlx5_core 0000:af:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 59s
[275684.083806] mlx5_core 0000:af:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 39s
[275704.049211] mlx5_core 0000:af:00.1: wait_fw_init:316:(pid 943): Waiting for FW initialization, timeout abort in 19s
[275723.954752] mlx5_core 0000:af:00.1: mlx5_function_setup:1237:(pid 943): Firmware over 120000 MS in pre-initializing state, aborting
[275723.968261] mlx5_core 0000:af:00.1: init_one:1813:(pid 943): mlx5_load_one failed with error code -16
[275723.978578] mlx5_core: probe of 0000:af:00.1 failed with error -16

The driver on the host server is dependent on the Arm side. If the driver on Arm is up, then the driver on the host server will also be up.

Please verify that:

The driver is loaded in the BlueField DPU
The Arm is booted into OS
The Arm is not in UEFI Boot Menu
The Arm is not hanged

Then:

Perform graceful shutdown.
Power cycle on the host server.
If the problem persists, reset nvconfig (sudo mlxconfig -d /dev/mst/<device> -y reset) and power cycle the host.
Note
If your DPU is VPI capable, please be aware that this configuration will reset the link type on the network ports to IB. To change the network port's link type to Ethernet, run:
Copy

Copied!

sudo mlxconfig -d <device> s LINK_TYPE_P1=2 LINK_TYPE_P2=2
If this problem still persists, please make sure to install the latest bfb image and then restart the driver in host server. Please refer to this page for more information.

No connectivity between network interfaces of source host to destination device

Verify that the bridge is configured properly on the Arm side.

The following is an example for default configuration:

Copy
Copied!

            
            $ sudo ovs-vsctl show
f6740bfb-0312-4cd8-88c0-a9680430924f
    Bridge ovsbr1                   
        Port pf0sf0                 
            Interface pf0sf0        
        Port p0                     
            Interface p0            
        Port pf0hpf                 
            Interface pf0hpf        
        Port ovsbr1                 
            Interface ovsbr1        
                type: internal      
    Bridge ovsbr2                   
        Port p1                     
            Interface p1            
        Port pf1sf0                 
            Interface pf1sf0        
        Port pf1hpf                 
            Interface pf1hpf        
        Port ovsbr2                 
            Interface ovsbr2       
                type: internal      
    ovs_version: "2.14.1"

If no bridge configuration exists, refer to "Virtual Switch on DPU".

Uplink in Arm down while uplink in host server up

Please check that the cables are connected properly into the network ports of the DPU and the peer device.

Performance Degradation

Degradation in performance indicates that openvswitch may not be offloaded.

Verify offload state. Run:

Copy
Copied!

            
            # ovs-vsctl get Open_vSwitch . other_config:hw-offload

If hw-offload = true – Fast Pass is configured (desired result)
If hw-offload = false – Slow Pass is configured

If hw-offload = false :

For RHEL/CentOS, run:

Copy
Copied!

            
            # ovs-vsctl set Open_vSwitch . other_config:hw-offload=true;
# systemctl restart openvswitch;
# systemctl enable openvswitch;

For Ubuntu/Debian, run:

Copy
Copied!

            
            # ovs-vsctl set Open_vSwitch . other_config:hw-offload=true;
# /etc/init.d/openvswitch-switch restart

SR-IOV Troubleshooting

Unable to create VFs

Please make sure that SR-IOV is enabled in BIOS.

Verify SRIOV_EN is true and NUM_OF_VFS bigger than 1. Run:

Copy
Copied!

            
            # mlxconfig -d /dev/mst/mt41686_pciconf0 -e q |grep -i "SRIOV_EN\|num_of_vf"
Configurations:           Default         Current         Next Boot
*        NUM_OF_VFS       16              16              16
*        SRIOV_EN         True(1)         True(1)         True(1)

Verify that GRUB_CMDLINE_LINUX="iommu=pt intel_iommu=on pci=assign-busses".

No traffic between VF to external host

Please verify creation of representors for VFs inside the Bluefield DPU. Run:

Copy
Copied!

            
            # /opt/mellanox/iproute2/sbin/rdma link |grep -i up
...
link mlx5_0/2 state ACTIVE physical_state LINK_UP netdev pf0vf0 
...

Make sure the representors of the VFs are added to the bridge. Run:

Copy
Copied!

            
            # ovs-vsctl add-port <bridage_name> pf0vf0

Verify VF configuration. Run:

Copy
Copied!

            
            $ ovs-vsctl show
bb993992-7930-4dd2-bc14-73514854b024
    Bridge ovsbr1
        Port pf0vf0
            Interface pf0vf0
                type: internal
        Port pf0hpf
            Interface pf0hpf
        Port pf0sf0
            Interface pf0sf0
        Port p0
            Interface p0
    Bridge ovsbr2
        Port ovsbr2
            Interface ovsbr2
                type: internal
        Port pf1sf0
            Interface pf1sf0
        Port p1
            Interface p1
        Port pf1hpf
            Interface pf1hpf
    ovs_version: "2.14.1"

eSwitch Troubleshooting

Unable to configure legacy mode

To set devlink to "Legacy" mode in BlueField, run:

Copy
Copied!

            
            # devlink dev eswitch set pci/0000:03:00.0 mode legacy
# devlink dev eswitch set pci/0000:03:00.1 mode legacy

Please verify that:

No virtual functions are open. To verify if VFs are configured, run:

Copy
Copied!

            
            # /opt/mellanox/iproute2/sbin/rdma link | grep -i up
link mlx5_0/2 state ACTIVE physical_state LINK_UP netdev pf0vf0
link mlx5_1/2 state ACTIVE physical_state LINK_UP netdev pf1vf0

If any VFs are configured, destroy them by running:

Copy
Copied!

            
            # echo 0 > /sys/class/infiniband/mlx5_0/device/mlx5_num_vfs
# echo 0 > /sys/class/infiniband/mlx5_1/device/mlx5_num_vfs

If any SFs are configured, delete them by running:

Copy
Copied!

            
            /sbin/mlnx-sf -a delete --sfindex <SF-Index>

Note

You may retrieve the <SF-Index> of the currently installed SFs by running:

Copy
Copied!

            
            # mlnx-sf -a show
 
SF Index: pci/0000:03:00.0/229408
  Parent PCI dev: 0000:03:00.0
  Representor netdev: en3f0pf0sf0
  Function HWADDR: 02:61:f6:21:32:8c
  Auxiliary device: mlx5_core.sf.2
    netdev: enp3s0f0s0
    RDMA dev: mlx5_2
 
SF Index: pci/0000:03:00.1/294944
  Parent PCI dev: 0000:03:00.1
  Representor netdev: en3f1pf1sf0
  Function HWADDR: 02:30:13:6a:2d:2c
  Auxiliary device: mlx5_core.sf.3
    netdev: enp3s0f1s0
    RDMA dev: mlx5_3

Pay attention to the SF Index values. For example:

Copy
Copied!

            
            /sbin/mlnx-sf -a delete --sfindex pci/0000:03:00.0/229408
/sbin/mlnx-sf -a delete --sfindex pci/0000:03:00.1/294944

If the error "Error: mlx5_core: Can't change mode when flows are configured" is encountered while trying to configure legacy mode, please make sure that

Any configured SFs are deleted (see above for commands).

Shut down the links of all interfaces, delete any ip xfrm rules, delete any configured OVS flows, and stop openvswitch service. Run:

Copy
Copied!

            
            ip link set dev p0 down
ip link set dev p1 down
ip link set dev pf0hpf down
ip link set dev pf1hpf down
ip link set dev vxlan_sys_4789 down
 
ip x s f ;
ip x p f ;
 
tc filter del dev p0 ingress
tc filter del dev p1 ingress
tc qdisc show dev p0 
tc qdisc show dev p1 
tc qdisc del dev p0 ingress 
tc qdisc del dev p1 ingress 
tc qdisc show dev p0 
tc qdisc show dev p1
 
systemctl stop openvswitch-switch

DPU appears as two interfaces

What this looks like:

Copy
Copied!

            
            # sudo /opt/mellanox/iproute2/sbin/rdma link
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0
link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev p1

Check if you are working in legacy mode.

Copy
Copied!

            
            # devlink dev eswitch show pci/0000:03:00.<0|1>

If the following line is printed, this means that you are working in legacy mode:

Copy
Copied!

            
            pci/0000:03:00.<0|1>: mode legacy inline-mode none encap enable

Please configure the DPU to work in switchdev mode. Run:

Copy
Copied!

            
            devlink dev eswitch set pci/0000:03:00.<0|1> mode switchdev

Check if you are working in separated mode:

Copy
Copied!

            
            # mlxconfig -d /dev/mst/mt41686_pciconf0 q | grep -i cpu
* INTERNAL_CPU_MODEL SEPERATED_HOST(0)

Please configure the DPU to work in embedded mode. Run:

Copy
Copied!

            
            # mlxconfig -d /dev/mst/mt41686_pciconf0 s INTERNAL_CPU_MODEL=1

DOCA Applications

This chapter deals with troubleshooting issues related to DOCA applications.

EAL Initialization Failure

EAL initialization failure is a common error that may appear while running various DPDK-related applications.

Error

The error looks like this:

Copy
Copied!

            
            [DOCA][ERR][NUTILS]: EAL initialization failed

There may be many causes for this error. Some of them are as follows:

The application requires huge pages and none were allocated
The application requires root privileges to run and it was run without elevated privileges

Solution

The following solutions are respective to the possible causes listed above:

Allocate huge pages. For example, run (on the host or the DPU, depending on where you are running the application):

Copy
Copied!

            
            sudo echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Run the application using sudo (or as root):

Copy
Copied!

            
            sudo <run_command>

Ring Memory Issue

This is a common memory issue when running application on the host.

Error

The error looks as follows:

Copy
Copied!

            
            RING: Cannot reserve memory
[13:00:57:290147][DOCA][ERR][UFLTR::Core:156]: DPI init failed

The most common cause for this error is lack of memory (i.e., not enough huge pages per worker threads).

Solution

Possible solutions:

Recommended: Increase the amount of allocated huge pages. Instructions for allocating huge pages can be found here.

Note

For an SFT application with 64 cores, it is recommended to increase the allocation from 2048 to 8192.
Alternatively, one can also limit the number of cores used by the application:
- -c <core-mask> – Set the hexadecimal bitmask of the cores to run on.
- -l <core-list> – list of cores to run on.

For example:

Copy
Copied!

            
            /opt/mellanox/doca/applications/<app_name>/bin/doca_<app_name> -a 3b:00.3 -a 3b:00.4 -l 0-64 -- -l 60

DOCA Apps Using DPDK in Parallel Issue

When running two DOCA apps in parallel that use DPDK, the first app runs but the second one fails.

Error

The following error is received:

Copy
Copied!

            
            Failed to start URL Filter with output: EAL: Detected 16 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: RTE Version: 'MLNX_DPDK 20.11.4.0.3' EAL: Detected shared linkage of DPDK
EAL: Cannot create lock on '/var/run/dpdk/rte/config'. Is another primary process running?
EAL: FATAL: Cannot init config
EAL: Cannot init config
[15:01:57:246339][DOCA][ERR][NUTILS]: EAL initialization failed

The cause of the error is that the second application is using /var/run/dpdk/rte/config when the first application is already using it.

Solution

To run two applications in parallel, the second application needs to be run with DPDK EAL option --file-prefix <name>.

In this example, after running the first application (without adding the eal option), to run the second with the EAL option. Run:

Copy
Copied!

            
            /opt/mellanox/doca/applications/<app_name>/bin/doca_<app_name> --file-prefix second -a 0000:01:00.6,sft_en=1 -a 0000:01:00.7,sft_en=1 -v -c 0xff -- -l 60

Compilation of DOCA Apps on CentOS

When compiling gRPC-enabled applications on old (7.6) CentOS machines, there is a conflict between the libstdc++ version available out-of-the-box and the one used by DOCA's SDK when building the gRPC packages.

Error

Compiling the gRPC-enabled application results in the following errors:

Copy
Copied!

            
            $ meson /tmp/build -Denable_grpc_support=true ; ninja -C /tmp/build
...
l_log_severity.a -Wl,--end-group
/opt/mellanox/grpc/lib/libgrpc++.a(server_cc.cc.o): In function `grpc::Server::RegisterService(std::string const*, grpc::Service*)':
(.text+0x2467): undefined reference to `std::basic_ios<char, std::char_traits<char> >::operator bool() const' /opt/mellanox/grpc/lib/libgrpc++.a(server_cc.cc.o): In function `grpc::Server::RegisterService(std::string const*, grpc::Service*)':
(.text+0x249e): undefined reference to `std::basic_ios<char, std::char_traits<char> >::operator bool() const' collect2: error: ld returned 1 exit status

Solution

Upgrading the devtoolset on the machine to the one used when building the gRPC package resolves the version conflict.

Copy
Copied!

            
            $ sudo yum install epel-release
$ sudo yum install centos-release-scl-rh
$ sudo yum install devtoolset-8
# This will enable the use of devtoolset-8 to the *current* bash session
$ source /opt/rh/devtoolset-8/enable

Failure to Set Huge Pages

When trying to configure the huge pages from an unprivileged user account, a permission error is raised.

Error

Compiling the gRPC-enabled application results in the following errors:

Copy
Copied!

            
            $ sudo echo 600 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
-bash: /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages: Permission denied

Solution

Using sudo with echo works differently than users usually expect. Instead, the command should be as follows:

Copy
Copied!

            
            $ echo '600' | sudo tee -a /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

DOCA Libraries

This chapter deals with troubleshooting issues related to DOCA libraries.

DOCA Flow Error

When trying to add new entry to the pipe, an error is received.

Error

The error happens after trying to add new entry function. The error message would look similar to the following:

Copy
Copied!

            
            mlx5_common: Failed to create TIR using DevX
mlx5_net: Port 0 cannot create DevX TIR.
[10:26:39:622581][DOCA][ERR][dpdk_engine]: create pipe entry fail on index:1, error=Port 0 create flow fail, type 1 message: cannot get hash queue, type=8

The issue here seems to be caused by SF/ports configuration.

Solution

To fix the issue, apply the following commands on the DPU:

Copy
Copied!

            
            dpu# /opt/mellanox/iproute2/sbin/devlink dev eswitch set pci/0000:03:00.0 mode legacy
dpu# /opt/mellanox/iproute2/sbin/devlink dev eswitch set pci/0000:03:00.1 mode legacy
dpu# echo none > /sys/class/net/p0/compat/devlink/encap
dpu# echo none > /sys/class/net/p1/compat/devlink/encap
dpu# /opt/mellanox/iproute2/sbin/devlink dev eswitch set pci/0000:03:00.0 mode switchdev
dpu# /opt/mellanox/iproute2/sbin/devlink dev eswitch set pci/0000:03:00.1 mode switchdeV

DOCA SDK Compilation

This chapter deals with troubleshooting issues related to compiling DOCA-based programs to use the DOCA SDK (e.g., missing dependencies).

Meson Complains About Missing Dependencies

As part of DOCA's installation, a basic set of environment variables are defined so that projects (such as DOCA applications) could easily compile against the DOCA SDK, and to allow users easy access to the various DOCA tools. In addition, the set of DOCA applications sometimes rely on various 3^rd party dependencies, some of which require specific environment variables so to be correctly found by the compilation environment (meson).

Error

There are multiple forms this error may appear in, such as:

DOCA libraries are missing:

Copy
Copied!

            
            Dependency doca found: NO (tried pkgconfig and cmake)
meson.build:13:1: ERROR:  Dependency "doca" not found, tried pkgconfig and cmake

DPDK definitions are missing:

Copy
Copied!

            
            Dependency libdpdk found: NO (tried pkgconfig and cmake)
meson.build:41:1: ERROR:  Dependency "libdpdk" not found, tried pkgconfig and cmake

mpicc is missing for DPA All to All application:

Copy
Copied!

            
            ====================
Skipped Applications
====================
 * dpa_all_to_all: Missing mpicc

gRPC definitions are missing (when gRPC support is activated):

Copy
Copied!

            
            Dependency protobuf found: NO (tried pkgconfig and cmake)
meson.build:47:1 ERROR:  Dependency "protobuf" not found, tried pkgconfig and cmake

gRPC compiler definitions are missing (when gRPC support is activated):

Copy
Copied!

            
            Dependency protobuf found: YES 3.15.8.0
Dependency grpc++ found: YES 1.39.0
Program protoc found: NO
meson.build:50:1: ERROR:  Program(s) ['protoc'] not found or not executable

Solution

All the dependencies mentioned above are installed as part of DOCA's installation, and yet it is recommended to check that the packages themselves were installed correctly. The packages that install each dependency define the environment variables needed by it, and apply these settings per user login session:

If DOCA was just installed (on the host or DPU), user session restart is required to apply these definitions (i.e., log off and log in).
It is important to compile DOCA using the same logged in user. Logging as ubuntu and using sudo su, or compiling using sudo, will not work.

If restarting the user session is not possible (e.g., automated non-interactive session), the following is a list of the needed environment variables:

Note

All the following examples use the required environment variables for the DPU. For the host, the values should be adjusted accordingly (aarch64 is for the DPU and x86 is for the host): aarch64-linux-gnu → x86_64-linux-gnu.

Tip

It is recommended to define all of the following settings so as to not have to remember which DOCA application requires which module (whether DPDK, gRPC, FlexIO, etc).

DOCA Libraries & Tools:

For Ubuntu:

Copy
Copied!

            
            export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/opt/mellanox/doca/lib/aarch64-linux-gnu/pkgconfig
export PATH=${PATH}:/opt/mellanox/doca/tools

For CentOS:

Copy
Copied!

            
            export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/opt/mellanox/doca/lib64/pkgconfig
export PATH=${PATH}:/opt/mellanox/doca/tools

DOCA Applications:

For Ubuntu and CentOS

Copy
Copied!

            
            export PATH=${PATH}:/usr/mpi/gcc/openmpi-4.1.7a1/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/mpi/gcc/openmpi-4.1.7a1/lib

DPDK:

For Ubuntu:

Copy
Copied!

            
            export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/opt/mellanox/dpdk/lib/aarch64-linux-gnu/pkgconfig

For CentOS:

Copy
Copied!

            
            export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/opt/mellanox/dpdk/lib64/pkgconfig

gRPC:

For Ubuntu:

Copy
Copied!

            
            export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/opt/mellanox/grpc/lib/pkgconfig
export PATH=${PATH}:/opt/mellanox/grpc/bin

For CentOS:

Copy
Copied!

            
            export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/opt/mellanox/grpc/lib/pkgconfig:/opt/mellanox/grpc/lib64/pkgconfig 
export PATH=${PATH}:/opt/mellanox/grpc/bin

FlexIO:

For Ubuntu:

Copy
Copied!

            
            export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/opt/mellanox/flexio/lib/pkgconfig

For CentOS:

Copy
Copied!

            
            export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/opt/mellanox/flexio/lib/pkgconfig

CollectX:

For Ubuntu and CentOS:

Copy
Copied!

            
            export PKG_CONFIG_PATH=${PKG_CONFIG_PATH}:/opt/mellanox/collectx/lib/aarch64-linux-gnu/pkgconfig

Static Compilation on CentOS: Undefined References to C++

When statically compiling against the DOCA SDK on RHEL 7.x machines, there could be a conflict between the libstdc++ version available out-of-the-box and the one used when building DOCA's SDK libraries.

Error

There are multiple forms this error may appear in, such as:

Copy
Copied!

            
            $ cc test.o -o test_out `pkg-config --libs --static doca`
/opt/mellanox/doca/lib64/libdoca_common.a(doca_common_core_src_doca_dev.cpp.o): In function `doca_devinfo_rep_list_create':
(.text.experimental+0x2193): undefined reference to `__cxa_throw_bad_array_new_length' /opt/mellanox/doca/lib64/libdoca_common.a(doca_common_core_src_doca_dev.cpp.o): In function `doca_devinfo_rep_list_create':
(.text.experimental+0x2198): undefined reference to `__cxa_throw_bad_array_new_length' collect2: error: ld returned 1 exit status

Solution

Upgrading the devtoolset on the machine to the one used when building the DOCA SDK resolves the undefined references issue:

Copy
Copied!

            
            $ sudo yum install epel-release
$ sudo yum install centos-release-scl-rh
$ sudo yum install devtoolset-8
# This will enable the use of devtoolset-8 to the *current* bash session
$ source /opt/rh/devtoolset-8/enable

Static Compilation on CentOS: Unresolved Symbols

When statically compiling against the DOCA SDK on RHEL 7.x machines, a known issue in the default pkg-config version (0.27) causes a linking error.

Error

There are multiple forms this error may appear in. For example:

Copy
Copied!

            
            $ cc test.o -o test_out 'pkg-config --libs --static doca' ...
/opt/mellanox/dpdk/lib64/librte_net_mlx5.a(net_mlx5_mlx5_sft.c.o): In function 'mlx5_sft_start':
mlx5_sft.c:(.text+0x1827): undefined reference to 'mlx5_malloc' ...

Solution

Use an updated version of pkg-config or pkgconf instead when building applications (as is recommended in DPDK's compilation instructions).

Cross-compiling DOCA and CUDA

This chapter deals with troubleshooting issues related to DOCA-CUDA cross-compilation.

Application Build Error

When trying to build with meson, an architecture-related error is received.

Error

The error may happen when trying to build DOCA or DOCA-CUDA applications.

Copy
Copied!

            
            cc1: error: unknown value 'corei7' for -march

It indicates that some dependency (usually libdpdk) is not taken from the host machine (i.e., the machine the executable file should be running on). This dependency should be taken from the Arm dependencies directories (the path is specified in the cross file) but is skipped if the host's PKG_CONFIG_PATH environment variable is used instead.

Solution

Make sure that the cross file contains the following PKG_CONFIG related definitions:

Copy
Copied!

            
            [built-in options]
pkg_config_path = '' [properties]
pkg_config_libdir = … // Some content here

In addition, verify that pkg_config_libdir properly points to all pkgconfig-related directories under your cross-build root directory, and that the dependency reported in the error is not missing.

DOCA Services (Containers)

This section deals with troubleshooting issues related to DOCA-based containers.

YAML Syntax Error #1

When deploying the container using the respective YAML file, the pod fails to start.

Error

The error may happen after modifying a service's YAML file, or after copying an example YAML file from one of the guides.

Copy
Copied!

            
            $ crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
$ journalctl -u kubelet
...
Oct 06 12:10:08 dpu-name kubelet[3260]: E1006 12:10:08.552306    3260 file.go:108] "Unable to process watch event" err="can't process config file \"/etc/kubelet.d/file_name.yaml\": invalid pod: [metadata.name: Invalid value: \"-dpu-name\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*') spec.containers: Required value]"
...

This indicates that some of the fields in the YAML file fail to comply with RFC 1123.

Solution

Both the pod name and container name have a strict alphabet (RFC 1123) restrictions. This means that users can only use dash ("-") and not underscore ("_") as the latter is an illegal character and cannot be used in the pod/container name. However, for the container's image name, use underscore ("_") instead of dash ("-") to help differentiate the two.

YAML Syntax Error #2

When deploying the container using the respective YAML file, the pod fails to start.

Error

The error may happen after modifying a service's YAML file, or after copying an example YAML file from one of the guides.

Note

This error can occur when there is a whitespace issue if the YAML file has been copied from one of the guides causing a formatting mistake. It is important to ensure that the space characters used in the files are indeed spaces (" ") and not some other whitespace character.

Copy
Copied!

            
            $ crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
$ journalctl -u kubelet
...
Oct 04 12:35:58 dpu-name kubelet[3046]: E1004 12:35:58.744406    3046 file.go:187] "Could not process manifest file" err="/etc/kubelet.d/file_name.yaml: couldn't parse as pod(yaml: line 48: did not find expected '-' indicator), please check config file" path="/etc/kubelet.d/file_name.yaml"
...

This indicates that there is a probable indentation issue in line 48 or in the line above it.

Solution

Go over the file and make sure that the file only uses spaces (" ") for indentations (2 per indent). Using any other number of spaces causes undefined behavior.

Missing Huge Pages

When deploying the container using the respective YAML file, the pod fails to start.

Error

Copy
Copied!

            
            $ crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
$ journalctl -u kubelet
...
Oct 04 12:39:41 dpu-name kubelet[3046]: I1004 12:39:41.643621    3046 predicate.go:103] "Failed to admit pod, unexpected error while attempting to recover from admission failure" pod="default/file_name" err="preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: hugepages-2Mi, q: 1021313024), ]"
...

This error indicates that the service expected 1GB (1021313024 bytes) of huge pages of size 2MB per page, and could not find them.

Solution

Remove the YAML file of the service from the deployment directory (/etc/kubelet.d).
Allocate huge pages as described in the service's prerequisites steps:
1. Make sure that the huge pages are allocated as required per the desired container.
2. Both the amount and size of the pages are important and must match precisely.

Restart the container infrastructure daemons:

Copy
Copied!

            
            sudo systemctl restart kubelet.service 
sudo systemctl restart containerd.service

Once the above operations are completed successfully, the container could be deployed (YAML can be copied to /etc/kubelet.d).

Failed to Reserve Sandbox Name

After rebooting the DPU, the respective pods start. However, the containers repeatedly fail to spawn and their "attempt" counter is not incrementing.

Error

Copy
Copied!

            
            $ crictl pods
POD ID              CREATED                  STATE               NAME                                      NAMESPACE           ATTEMPT             RUNTIME
bee147792a85b       Less than a second ago   Ready               doca-hbn-service-my-dpu                   default             0                   (default)
ea66ee46e75a5       Less than a second ago   Ready               doca-telemetry-service-my-dpu             default             0                   (default)
 
$ crictl ps -a
CONTAINER           IMAGE               CREATED                  STATE               NAME                       ATTEMPT             POD ID              POD
6a35c025a3590       ce4c0cafd583e       Less than a second ago   Exited              init-sfs                   0                   bee147792a85b       doca-hbn-service-my-dpu
9048f4c7b8f3c       095a5833a3f80       Less than a second ago   Running             doca-telemetry-service     0                   ea66ee46e75a5       doca-telemetry-service-my-dpu
059d0aa8a3199       095a5833a3f80       Less than a second ago   Exited              init-telemetry-service     0                   ea66ee46e75a5       doca-telemetry-service-my-dpu
bcfbe536271ea       ce4c0cafd583e       33 seconds ago           Running             init-sfs                   1                   bee147792a85b       doca-hbn-service-my-dpu
 
$ journalctl -u containerd
...
"2023-11-28T08:43:42.408173348+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:doca-hbn-service-my-dpu,Uid:823b1ad0e241a10475edde26e905856b,Namespace:default,Attempt:0,} failed, error" error="failed to reserve sandbox name \"doca-hbn-service-my-dpu_default_823b1ad0e241a10475edde26e905856b_0\": name \"doca-hbn-service-my-dpu_default_823b1ad0e241a10475edde26e905856b_0\" is reserved for \"bee147792a85bc23a3629a9fcd0a5f388794f6b67ef552c959d4d5e49d04f5b2\""
...

This error indicates that there was some collision with prior instances of the "doca-hbn-service" container, probably pre-reboot.

Note

This issue indicates irregularities in the time of the machine, and usually that the DPU's time pre-reboot was later than the time post-reboot. This leads to bugs in the recovery of the container infrastructure daemons. It is of utmost importance that the time of the system does not jump backwards.

Solution

Remove all YAML files from the deployment directory (/etc/kubelet.d).

Stop all pods:

Copy
Copied!

            
            sudo crictl stopp $(crictl pods | tail -n +2 | awk '{ print $1 }')

Clear all containers:

Copy
Copied!

            
            sudo ctr -n k8s.io container rm $(ctr -n k8s.io container ls | tail -n +2 | awk '{ print $1 }')

Make sure the system's time is correct, and adjust it if needed:
Copy

Copied!
```
            
            date
        
```

Restart the container infrastructure daemons:

Copy
Copied!

            
            sudo systemctl restart kubelet.service 
sudo systemctl restart containerd.service

Once the above operations are completed successfully, the container could be deployed (YAML can be copied to /etc/kubelet.d).

How to Perform Graceful Shutdown

Before powering off or power cycling the DPU, it is strongly recommended to perform graceful shutdown of the DPU Arm OS.

Graceful shutdown of the Arm OS ensures that data within the eMMC/NVMe cache is properly written to storage, and helps prevent filesystem inconsistencies and file corruption.

There are several ways to gracefully shutdown the DPU Arm OS:

Log into the DPU Arm OS and perform a shutdown command prior to power cycling the host server. For example:
Copy

Copied!
```
            
            sudo shutdown -h now
        
```
Assuming the DPU BMC can issue NC-SI OEM commands to the DPU:
1. Issue the Shutdown Smart NIC OS NC-SI OEM command.
2. After DPU Arm OS shutdown, it is recommended to issue DPU Arm OS state query which indicates whether DPU Arm OS shutdown has completed (standby indication). This can be done by issuing the Get Smart NIC OS State NC-SI OEM command.

On This Page