GPUNetIO Installation and Setup

DOCA GPUNetIO is included in the doca-all package, which is available from the DOCA downloads portal for all supported operating systems.

To install the required DOCA GPUNetIO components, use the package manager for your OS.

For Ubuntu/Debian:

Copy
Copied!

            
            apt install doca-all doca-sdk-gpunetio libdoca-sdk-gpunetio-dev

For RHEL:

Copy
Copied!

            
            yum install doca-all doca-sdk-gpunetio doca-sdk-gpunetio-devel

Note

To achieve the best performance when building any DOCA GPUNetIO sample or application, you must set the buildtype to release in the meson.build file (e.g., buildtype = 'release'). Building in the default debug mode will result in significantly lower performance.

To run a DOCA GPUNetIO application, the system must be configured with both a GPU and a NIC (either NVIDIA® ConnectX® or NVIDIA® BlueField®), connected to the system via PCIe.

image-2025-7-10_11-18-26-version-1-modificationdate-1771854183210-api-v2.png

The system's internal hardware topology should be GPUDirect-RDMA-friendly to maximize the internal throughput between the GPU and the NIC. To verify the type of connection between the GPU and NIC:

Copy
Copied!

            
            $ nvidia-smi topo -m
	    GPU0	NIC0	NIC1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	    NODE	NODE	12-23,36-47	    1				N/A
NIC0	NODE	 X 	    PIX
NIC1	NODE	PIX	     X
 
Legend:
 
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
 
NIC Legend:
 
  NIC0: mlx5_0
  NIC1: mlx5_1

To maximize throughput between the GPU and NIC, the system should have a PIX (or PXB) topology with a dedicated PCIe connection. A PHB topology is still acceptable if the GPU and NIC are on the same PCIe Host Bridge and NUMA node, although performance may vary depending on the platform. For optimal performance, it's recommended to avoid NODE and SYS topologies, as they may negatively impact performance despite the application remaining functional.

Note

DOCA GPUNetIO has been fully tested on bare-metal systems and within Docker containers. Support for virtualized environments is currently considered experimental.

NIC Configuration

ConnectX NIC

Note

Ensure the ConnectX firmware is compatible with the current DOCA release. NVIDIA recommends using ConnectX-6 Dx or later adapters.

Start MST:

Copy
Copied!

            
            $ sudo mst start

Check MST status to get the MST device identifier:

Copy
Copied!

            
            $ sudo mst status -v

Example output:

Copy
Copied!

            
            MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA
ConnectX6DX(rev:0)      /dev/mst/mt4125_pciconf0.1    b5:00.1   mlx5_1          net-ens6f1                0
ConnectX6DX(rev:0)      /dev/mst/mt4125_pciconf0      b5:00.0   mlx5_0          net-ens6f0                0

Configure ConnectX NIC:

For Ethernet transport, r un the following commands, replacing <mst_device> with the actual MST device name (e.g., /dev/mst/mt4125_pciconf0 ):

Copy
Copied!

            
            mlxconfig -d <mst_device> s KEEP_ETH_LINK_UP_P1=1 KEEP_ETH_LINK_UP_P2=1 KEEP_IB_LINK_UP_P1=0 KEEP_IB_LINK_UP_P2=0
# This is required only if application uses the Accurate Send Scheduling feature
mlxconfig -d <mst_device> --yes set ACCURATE_TX_SCHEDULER=1 REAL_TIME_CLOCK_ENABLE=1

Info

The following example assumes that the adapter is dual port. If single port, only P1 options apply.

For InfiniBand transport, run:

Copy
Copied!

            
            mlxconfig -d <mst_device> s KEEP_ETH_LINK_UP_P1=0 KEEP_ETH_LINK_UP_P2=0 KEEP_IB_LINK_UP_P1=1 KEEP_IB_LINK_UP_P2=1
# Accurate Send Scheduling feature can't be used with InfiniBand

Info

The following example assumes that the adapter is dual port. If single port, only P1 options apply.

Perform a cold reboot to apply the changes :

Copy
Copied!

            
            ipmitool power cycle

BlueField NIC

To use NVIDIA BlueField-2 or BlueField-3 with DOCA GPUNetIO, the DPU must be in NIC mode to expose the internal ConnectX to the host application.

Start MST:

Copy
Copied!

            
            $ sudo mst start

Check MST status to get the mst device identifier:

Copy
Copied!

            
            $ sudo mst status -v

Example output:

Copy
Copied!

            
            MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE                         MST                           PCI       RDMA            NET                                     NUMA  
BlueField3(rev:1)                   /dev/mst/mt41692_pciconf0.1   9f:00.1   mlx5_1          net-ens6f1np1                           1
BlueField3(rev:1)                   /dev/mst/mt41692_pciconf0     9f:00.0   mlx5_0          net-ens6f0np0                           1

Configure BlueField NIC:

For Ethernet transport:

Copy
Copied!

            
            sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set LINK_TYPE_P1=2 LINK_TYPE_P2=2 INTERNAL_CPU_MODEL=1 INTERNAL_CPU_PAGE_SUPPLIER=1 INTERNAL_CPU_ESWITCH_MANAGER=1 INTERNAL_CPU_IB_VPORT0=1 INTERNAL_CPU_OFFLOAD_ENGINE=DISABLED
# This is required only if application uses the Accurate Send Scheduling feature
sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set ACCURATE_TX_SCHEDULER=1 REAL_TIME_CLOCK_ENABLE=1

For InfiniBand transport:

Copy
Copied!

            
            sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set LINK_TYPE_P1=1 LINK_TYPE_P2=1 INTERNAL_CPU_MODEL=1 INTERNAL_CPU_PAGE_SUPPLIER=1 INTERNAL_CPU_ESWITCH_MANAGER=1 INTERNAL_CPU_IB_VPORT0=1 INTERNAL_CPU_OFFLOAD_ENGINE=DISABLED
# Accurate Send Scheduling feature can't be used with Infiniband

Do a cold reboot to apply the changes :

Copy
Copied!

            
            ipmitool power cycle

Example verification command for Ethernet.

Copy
Copied!

            
            sudo mlxconfig -d /dev/mst/mt41692_pciconf0 q LINK_TYPE_P1 LINK_TYPE_P2 INTERNAL_CPU_MODEL INTERNAL_CPU_PAGE_SUPPLIER INTERNAL_CPU_ESWITCH_MANAGER INTERNAL_CPU_IB_VPORT0 INTERNAL_CPU_OFFLOAD_ENGINE ACCURATE_TX_SCHEDULER REAL_TIME_CLOCK_ENABLE

Example output (Ethernet):

Copy
Copied!

            
                     LINK_TYPE_P1                                ETH(2)
         LINK_TYPE_P2                                ETH(2)
         INTERNAL_CPU_MODEL                          EMBEDDED_CPU(1)
         INTERNAL_CPU_PAGE_SUPPLIER                  EXT_HOST_PF(1)
         INTERNAL_CPU_ESWITCH_MANAGER                EXT_HOST_PF(1)
         INTERNAL_CPU_IB_VPORT0                      EXT_HOST_PF(1)
         INTERNAL_CPU_OFFLOAD_ENGINE                 DISABLED(1)
         ACCURATE_TX_SCHEDULER                       True(1)
         REAL_TIME_CLOCK_ENABLE                      True(1)

PCIe Configuration

On some x86 systems, the Access Control Services (ACS) must be disabled to ensure direct communication between the NIC and GPU, whether they reside on the same converged accelerator DPU or on different PCIe slots in the system. The recommended solution is to disable ACS control via BIOS (e.g., Supermicro or HPE) on your PCIe bridge. Alternatively, it is also possible to disable it via command line, but it may not be as effective as the BIOS option.

The following lspci -tvvv output illustrates a typical system topology:

Copy
Copied!

            
            $ lspci -tvvv...+-[0000:b0]-+-00.0  Intel Corporation Device 09a2
 |           +-00.1  Intel Corporation Device 09a4
 |           +-00.2  Intel Corporation Device 09a3
 |           +-00.4  Intel Corporation Device 0998
 |           \-02.0-[b1-b6]----00.0-[b2-b6]--+-00.0-[b3]--+-00.0  Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
 |                                           |            +-00.1  Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
 |                                           |            \-00.2  Mellanox Technologies MT42822 BlueField-2 SoC Management Interface
 |                                           \-01.0-[b4-b6]----00.0-[b5-b6]----08.0-[b6]----00.0  NVIDIA Corporation Device 20b8

The PCIe switch address to consider is b2:00.0 (entry point of the DPU). ACSCtl must have all negative values:

PCIe set

Copy
Copied!

            
            setpci -s b2:00.0 ECAP_ACS+0x6.w=0000

To verify that the setting has been applied correctly:

PCIe check

Copy
Copied!

            
            $ sudo lspci -s b2:00.0 -vvvv | grep -i ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

Please refer to this page and this page for more information.

If the application still does not report any received packets, try to disable IOMMU. On some systems, it can be done from the BIOS looking for the the VT-d or IOMMU from the NorthBridge configuration and change that setting to Disable and save it. The system may also require adding intel_iommu=off or amd_iommu=off to the kernel options. That can be done through the grub command line as follows:

IOMMU

Copy
Copied!

            
            $ sudo vim /etc/default/grub
# GRUB_CMDLINE_LINUX_DEFAULT="iommu=off intel_iommu=off <more options>"
$ sudo update-grub
$ sudo reboot

GPU Configuration

CUDA Dependency

DOCA GPUNetIO components have a dependency on CUDA. These dependencies differ for the CPU-side shared library versus the GPU-side datapath components.

CPU Shared Library (libdoca_gpunetio.so) This library has a dependency on libcuda.so (CUDA Driver API). Because it does not use the CUDA Runtime API, it is not subject to potential versioning issues associated with the runtime.
GPU Datapath Components The data path functions are delivered as both header files and a static library, which have different requirements:
- Header-only APIs (GPUNetIO Ethernet, GPUNetIO Verbs): These are inlined functions. Since they are compiled with your application, they are flexible and can be used with any recent CUDA version (e.g., CUDA 12.x or 13.x).
- Static Library APIs (GPUNetIO DMA, CommCh, RDMA): This library is pre-built with CUDA 13.0. Therefore, any application using functions from this static library must be built with CUDA 13.0 or newer.

It is generally recommended to use CUDA 12.6 or newer wherever possible to take advantage of new features.

To decrease initial application startup latency, it is highly recommended to enable NVIDIA driver persistence mode:

Copy
Copied!

            
            nvidia-smi -pm 1

GDRCopy Installation

To enable direct CPU access to GPU memory without using CUDA APIs, DOCA requires the GDRCopy kernel module and library.

Install necessary packages:

Copy
Copied!

            
            sudo apt install -y check kmod

Clone the GDRCopy repository:

Copy
Copied!

            
            git clone https://github.com/NVIDIA/gdrcopy.git /opt/mellanox/gdrcopy

Build GDRCopy:

Copy
Copied!

            
            cd /opt/mellanox/gdrcopy && make

Load the GDRCopy kernel module:

Copy
Copied!

            
            ./insmod.sh

Check if the gdrdrv and nvidia-peermem modules are loaded:

Copy
Copied!

            
            lsmod | egrep gdrdrv

Example output:

Copy
Copied!

            
            gdrdrv                 24576  0
nvidia              55726080  4 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset

Export the GDRCopy library path:

Copy
Copied!

            
            export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/mellanox/gdrcopy/src

Ensure CUDA library paths are in the environment variables:

Copy
Copied!

            
            export PATH="/usr/local/cuda/bin:${PATH}"
export LD_LIBRARY_PATH="/usr/local/cuda/lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}"
export CPATH="$(echo /usr/local/cuda/targets/{x86_64,sbsa}-linux/include | sed 's/ /:/'):${CPATH}"

Note

GDRCopy is optional. If not installed, DOCA GPUNetIO cannot allocate memory using the DOCA_GPU_MEM_TYPE_GPU_CPU flag. If GDRCopy is not detected, DOCA GPUNetIO will log warning messages.

If GDRCopy is not required for your application, you can safely ignore the related warning messages. To use GDRCopy, ensure its installation path is included in the LD_LIBRARY_PATH environment variable or specified using the GDRCOPY_PATH_L environment variable.

GPU Memory Mapping

To enable the NIC to send and receive packets using GPU memory, a memory mapping mechanism must be used. DOCA supports two methods:

dmabuf (default method): The preferred, modern method for mapping GPU memory.
nvidia-peermem (fallback method): A legacy method used if dmabuf is not available or fails.

Using dmabuf

This is the primary method for mapping GPU memory. The prerequisites for this approach are:

Linux Kernel version 6.2 or later
libibverbs version 1.14.44 or later
CUDA Toolkit:
- Version 12.5 or older: Must be installed with the -m=kernel-open flag (implying open-source NVIDIA driver mode).
- Version 12.6 or newer: Open kernel mode is enabled by default.

Note

Please ensure your system has nvidia-open drivers installed. If it shows cuda-drivers instead, it means the NVIDIA driver is installed as close source version and then dmabuf can't be used.

Using nvidia-peermem

This method is used if dmabuf is unavailable. It requires the nvidia-peermem kernel module, which is installed with the CUDA Toolkit, to be loaded:

Launch nvidia-peermem

Copy
Copied!

            
            sudo modprobe nvidia-peermem

Implementation and Fallback Logic

The recommended implementation is to attempt to get a dmabuf file descriptor first. If that fails, the application should fall back to the nvidia-peermem method.

The following code snippet demonstrates how to use dmabuf for GPU memory mapping with DOCA mmap, including the fallback logic:

GPU Configuration

Copy
Copied!

            
            /* Get the dmabuf file-descriptor for the GPU memory buffer from CUDA */
result = doca_gpu_dmabuf_fd(gpu_dev, gpu_buffer_addr, gpu_buffer_size, &(dmabuf_fd));
 
if (result != DOCA_SUCCESS) {
    /* Fallback to nvidia-peermem legacy method if dmabuf fails */
    doca_mmap_set_memrange(gpu_buffer_mmap, gpu_buffer_addr, gpu_buffer_size);
} else {
    /* Create DOCA mmap using dmabuf */
    doca_mmap_set_dmabuf_memrange(gpu_buffer_mmap, dmabuf_fd, gpu_buffer_addr, 0, gpu_buffer_size);
}

Handling dmabuf Failure

A failure in doca_gpu_dmabuf_fd (the if block in the example) likely indicates that the NVIDIA driver is not in open-source mode.

When doca_mmap_start is subsequently called, DOCA will attempt to map the GPU memory. If dmabuf was not set, it will automatically fall back to the legacy nvidia-peermem method. In this case, the following warning message is logged:

GPU Configuration

Copy
Copied!

            
            [DOCA][WRN][linux_devx_adapter.cpp:374] devx adapter 0x5566a16018e0: Registration using dmabuf is not supported, falling back to legacy registration

Note

If your application can rely on nvidia-peermem and does not strictly require dmabuf, this warning message can be safely ignored.

Sample Implementations

GPUNetIO Ethernet samples use DOCA mmap with dmabuf and nvidia-peermem as the fallback (following the logic in the code example above).
GPUNetIO Verbs samples show an alternative verbs-based method, using ibv_reg_dmabuf_mr (for dmabuf) and ibv_reg_mr (as the fallback).

GPU BAR1 Size

Every time a GPU buffer is mapped to the NIC (e.g., buffers associated with send or receive queues), a portion of the GPU BAR1 mapping space is used. Therefore, it is important to check that the BAR1 mapping is large enough to hold all the bytes the DOCA GPUNetIO application is trying to map. To verify the BAR1 mapping space of a GPU you can use nvidia-smi:

BAR1 mapping

Copy
Copied!

            
            $ nvidia-smi -q
 
==============NVSMI LOG==============
.....
Attached GPUs                             : 1
GPU 00000000:CA:00.0
    Product Name                          : NVIDIA A100 80GB PCIe
    Product Architecture                  : Ampere
    Persistence Mode                      : Enabled
.....
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 1 MiB
        Free                              : 131071 MiB

By default, some GPUs (e.g. RTX models) may have a very small BAR1 size:

BAR1 mapping

Copy
Copied!

            
            $ nvidia-smi -q | grep -i bar -A 3
	BAR1 Memory Usage
	Total : 256 MiB
	Used : 6 MiB
	Free : 250 MiB

If the BAR1 size is not enough, DOCA GPUNetIO applications may exit with errors because DOCA mmap fails to map the GPU memory buffers to the NIC (e.g., Failed to start mmap DOCA Driver call failure). To overcome this issue, the GPU BAR1 must be increased from the BIOS. The system should have "Resizable BAR" option enabled. For further information, refer to this NVIDIA forum post.

Running without Root Privileges

All DOCA GPUNetIO samples and applications using Ethernet rely on DOCA Flow. Therefore, they must be executed with sudo or root privileges.

However, Verbs, RDMA and DMA samples can be run without sudo privileges if a specific option is enabled in the NVIDIA driver:

Create a configuration file for the NVIDIA driver:

Copy
Copied!

            
            cat <<EOF | sudo tee /etc/modprobe.d/nvidia.conf
options nvidia NVreg_RegistryDwords="PeerMappingOverride=1;"
EOF

Perform a cold reboot to ensure the changes take effect.

Verify that the configuration has been applied using the following command:

Copy
Copied!

            
            $ grep RegistryDwords /proc/driver/nvidia/params

You should see the following output confirming the setting:

Copy
Copied!

            
            RegistryDwords: "PeerMappingOverride=1;"

DGX Spark

Due to hardware topology limitations, DGX Spark does not support GPUDirect RDMA. However, DOCA GPUNetIO applications can still execute on these systems by utilizing CPU pinned memory (DOCA_GPU_MEM_TYPE_CPU_GPU) instead of GPU memory. This applies to both queue allocation and packet/data memory.

Ethernet Configuration

When creating Rx or Tx queues with DOCA Ethernet, you must use the setters in doca_eth_rxq_gpu_data_path.h and doca_eth_txq_gpu_data_path.h to allocate queues on the CPU-GPU shared memory:

DOCA Ethernet queue memory setters

Copy
Copied!

            
            doca_error_t doca_eth_rxq_gpu_set_rq_mem_type(struct doca_eth_rxq *eth_rxq, enum doca_gpu_mem_type rq_mem_type);
doca_error_t doca_eth_txq_gpu_set_sq_mem_type(struct doca_eth_txq *eth_txq, enum doca_gpu_mem_type sq_mem_type);

For Tx queues, you must also enable CPU proxy mode to handle transmission:

DOCA Ethernet CPU proxy mode

Copy
Copied!

            
            doca_error_t doca_eth_txq_gpu_set_uar_on_cpu(struct doca_eth_txq *eth_txq);

See the gpunetio_simple_send_sample for an implementation of CPU proxy on the data path.

Verbs RDMA Configuration

A similar approach applies to GPUNetIO Verbs applications, with the distinction that the application is responsible for explicitly allocating the queue memory (QP UMEM) using the DOCA_GPU_MEM_TYPE_CPU_GPU memory type.

For the send side, CPU proxy mode is also supported. Refer to the following GPUNetIO Verbs examples for implementation details:

gpunetio_verbs_put_bw_sample
gpunetio_verbs_put_counter_bw_sample
gpunetio_verbs_write_bw_sample
gpunetio_verbs_write_lat_sample

On This Page

PCIe set

PCIe check

IOMMU

Launch nvidia-peermem

GPU Configuration

GPU Configuration

BAR1 mapping

BAR1 mapping

DOCA Ethernet queue memory setters

DOCA Ethernet CPU proxy mode