GPUNetIO Installation and Setup
DOCA GPUNetIO is included in the doca-all package, which is available from the DOCA downloads portal for all supported operating systems.
To install the required DOCA GPUNetIO components, use the package manager for your OS.
For Ubuntu/Debian:
apt install doca-all doca-sdk-gpunetio libdoca-sdk-gpunetio-dev
For RHEL:
yum install doca-all doca-sdk-gpunetio doca-sdk-gpunetio-devel
To achieve the best performance when building any DOCA GPUNetIO sample or application, you must set the buildtype to release in the meson.build file (e.g., buildtype = 'release'). Building in the default debug mode will result in significantly lower performance.
To run a DOCA GPUNetIO application, the system must be configured with both a GPU and a NIC (either NVIDIA® ConnectX® or NVIDIA® BlueField®), connected to the system via PCIe.
The system's internal hardware topology should be GPUDirect-RDMA-friendly to maximize the internal throughput between the GPU and the NIC. To verify the type of connection between the GPU and NIC:
$ nvidia-smi topo -m
GPU0 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE 12-23,36-47 1 N/A
NIC0 NODE X PIX
NIC1 NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
To maximize throughput between the GPU and NIC, the system should have a PIX (or PXB) topology with a dedicated PCIe connection. A PHB topology is still acceptable if the GPU and NIC are on the same PCIe Host Bridge and NUMA node, although performance may vary depending on the platform. For optimal performance, it's recommended to avoid NODE and SYS topologies, as they may negatively impact performance despite the application remaining functional.
DOCA GPUNetIO has been fully tested on bare-metal systems and within Docker containers. Support for virtualized environments is currently considered experimental.
ConnectX NIC
Ensure the ConnectX firmware is compatible with the current DOCA release. NVIDIA recommends using ConnectX-6 Dx or later adapters.
Start MST:
$
sudomst startCheck MST status to get the MST device identifier:
$
sudomst status -vExample output:
MST modules: ------------ MST PCI module is not loaded MST PCI configuration module loaded PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA ConnectX6DX(rev:0) /dev/mst/mt4125_pciconf0.1 b5:00.1 mlx5_1 net-ens6f1 0 ConnectX6DX(rev:0) /dev/mst/mt4125_pciconf0 b5:00.0 mlx5_0 net-ens6f0 0
Configure ConnectX NIC:
For Ethernet transport, r un the following commands, replacing <mst_device> with the actual MST device name (e.g., /dev/mst/mt4125_pciconf0 ):
mlxconfig -d <mst_device> s KEEP_ETH_LINK_UP_P1=1 KEEP_ETH_LINK_UP_P2=1 KEEP_IB_LINK_UP_P1=0 KEEP_IB_LINK_UP_P2=0
# This is required only if application uses the Accurate Send Scheduling featuremlxconfig -d <mst_device> --yessetACCURATE_TX_SCHEDULER=1 REAL_TIME_CLOCK_ENABLE=1InfoThe following example assumes that the adapter is dual port. If single port, only P1 options apply.
For InfiniBand transport, run:
mlxconfig -d <mst_device> s KEEP_ETH_LINK_UP_P1=0 KEEP_ETH_LINK_UP_P2=0 KEEP_IB_LINK_UP_P1=1 KEEP_IB_LINK_UP_P2=1
# Accurate Send Scheduling feature can't be used with InfiniBandInfoThe following example assumes that the adapter is dual port. If single port, only P1 options apply.
Perform a cold reboot to apply the changes :
ipmitool power cycle
BlueField NIC
To use NVIDIA BlueField-2 or BlueField-3 with DOCA GPUNetIO, the DPU must be in NIC mode to expose the internal ConnectX to the host application.
Start MST:
$
sudomst startCheck MST status to get the mst device identifier:
$
sudomst status -vExample output:
MST modules: ------------ MST PCI module is not loaded MST PCI configuration module loaded PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA BlueField3(rev:1) /dev/mst/mt41692_pciconf0.1 9f:00.1 mlx5_1 net-ens6f1np1 1 BlueField3(rev:1) /dev/mst/mt41692_pciconf0 9f:00.0 mlx5_0 net-ens6f0np0 1
Configure BlueField NIC:
For Ethernet transport:
sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set LINK_TYPE_P1=
2LINK_TYPE_P2=2INTERNAL_CPU_MODEL=1INTERNAL_CPU_PAGE_SUPPLIER=1INTERNAL_CPU_ESWITCH_MANAGER=1INTERNAL_CPU_IB_VPORT0=1INTERNAL_CPU_OFFLOAD_ENGINE=DISABLED # This is required onlyifapplication uses the Accurate Send Scheduling feature sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set ACCURATE_TX_SCHEDULER=1REAL_TIME_CLOCK_ENABLE=1For InfiniBand transport:
sudo mlxconfig -d /dev/mst/mt41692_pciconf0 --yes set LINK_TYPE_P1=
1LINK_TYPE_P2=1INTERNAL_CPU_MODEL=1INTERNAL_CPU_PAGE_SUPPLIER=1INTERNAL_CPU_ESWITCH_MANAGER=1INTERNAL_CPU_IB_VPORT0=1INTERNAL_CPU_OFFLOAD_ENGINE=DISABLED # Accurate Send Scheduling feature can't be used with Infiniband
Do a cold reboot to apply the changes :
ipmitool power cycle
Example verification command for Ethernet.
sudo mlxconfig -d /dev/mst/mt41692_pciconf0 q LINK_TYPE_P1 LINK_TYPE_P2 INTERNAL_CPU_MODEL INTERNAL_CPU_PAGE_SUPPLIER INTERNAL_CPU_ESWITCH_MANAGER INTERNAL_CPU_IB_VPORT0 INTERNAL_CPU_OFFLOAD_ENGINE ACCURATE_TX_SCHEDULER REAL_TIME_CLOCK_ENABLE
Example output (Ethernet):
LINK_TYPE_P1 ETH(2) LINK_TYPE_P2 ETH(2) INTERNAL_CPU_MODEL EMBEDDED_CPU(1) INTERNAL_CPU_PAGE_SUPPLIER EXT_HOST_PF(1) INTERNAL_CPU_ESWITCH_MANAGER EXT_HOST_PF(1) INTERNAL_CPU_IB_VPORT0 EXT_HOST_PF(1) INTERNAL_CPU_OFFLOAD_ENGINE DISABLED(1) ACCURATE_TX_SCHEDULER True(1) REAL_TIME_CLOCK_ENABLE True(1)
On some x86 systems, the Access Control Services (ACS) must be disabled to ensure direct communication between the NIC and GPU, whether they reside on the same converged accelerator DPU or on different PCIe slots in the system. The recommended solution is to disable ACS control via BIOS (e.g., Supermicro or HPE) on your PCIe bridge. Alternatively, it is also possible to disable it via command line, but it may not be as effective as the BIOS option.
The following lspci -tvvv output illustrates a typical system topology:
$ lspci -tvvv...+-[0000:b0]-+-00.0 Intel Corporation Device 09a2
| +-00.1 Intel Corporation Device 09a4
| +-00.2 Intel Corporation Device 09a3
| +-00.4 Intel Corporation Device 0998
| \-02.0-[b1-b6]----00.0-[b2-b6]--+-00.0-[b3]--+-00.0 Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
| | +-00.1 Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller
| | \-00.2 Mellanox Technologies MT42822 BlueField-2 SoC Management Interface
| \-01.0-[b4-b6]----00.0-[b5-b6]----08.0-[b6]----00.0 NVIDIA Corporation Device 20b8
The PCIe switch address to consider is b2:00.0 (entry point of the DPU). ACSCtl must have all negative values:
PCIe set
setpci -s b2:00.0 ECAP_ACS+0x6.w=0000
To verify that the setting has been applied correctly:
PCIe check
$ sudo lspci -s b2:00.0 -vvvv | grep -i ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Please refer to this page and this page for more information.
If the application still does not report any received packets, try to disable IOMMU. On some systems, it can be done from the BIOS looking for the the VT-d or IOMMU from the NorthBridge configuration and change that setting to Disable and save it. The system may also require adding intel_iommu=off or amd_iommu=off to the kernel options. That can be done through the grub command line as follows:
IOMMU
$ sudo vim /etc/default/grub
# GRUB_CMDLINE_LINUX_DEFAULT="iommu=off intel_iommu=off <more options>"
$ sudo update-grub
$ sudo reboot
CUDA Dependency
DOCA GPUNetIO components have a dependency on CUDA. These dependencies differ for the CPU-side shared library versus the GPU-side datapath components.
CPU Shared Library (
libdoca_gpunetio.so) This library has a dependency onlibcuda.so(CUDA Driver API). Because it does not use the CUDA Runtime API, it is not subject to potential versioning issues associated with the runtime.GPU Datapath Components The data path functions are delivered as both header files and a static library, which have different requirements:
Header-only APIs (GPUNetIO Ethernet, GPUNetIO Verbs): These are inlined functions. Since they are compiled with your application, they are flexible and can be used with any recent CUDA version (e.g., CUDA 12.x or 13.x).
Static Library APIs (GPUNetIO DMA, CommCh, RDMA): This library is pre-built with CUDA 13.0. Therefore, any application using functions from this static library must be built with CUDA 13.0 or newer.
It is generally recommended to use CUDA 12.6 or newer wherever possible to take advantage of new features.
To decrease initial application startup latency, it is highly recommended to enable NVIDIA driver persistence mode:
nvidia-smi -pm 1
GDRCopy Installation
To enable direct CPU access to GPU memory without using CUDA APIs, DOCA requires the GDRCopy kernel module and library.
Install necessary packages:
sudo apt install -y check kmod
Clone the GDRCopy repository:
git clone https:
//github.com/NVIDIA/gdrcopy.git /opt/mellanox/gdrcopyBuild GDRCopy:
cd /opt/mellanox/gdrcopy && make
Load the GDRCopy kernel module:
./insmod.sh
Check if the gdrdrv and nvidia-peermem modules are loaded:
lsmod | egrep gdrdrv
Example output:
gdrdrv
245760nvidia557260804nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modesetExport the GDRCopy library path:
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/mellanox/gdrcopy/src
Ensure CUDA library paths are in the environment variables:
export PATH=
"/usr/local/cuda/bin:${PATH}"export LD_LIBRARY_PATH="/usr/local/cuda/lib:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}"export CPATH="$(echo /usr/local/cuda/targets/{x86_64,sbsa}-linux/include | sed 's/ /:/'):${CPATH}"
GDRCopy is optional. If not installed, DOCA GPUNetIO cannot allocate memory using the DOCA_GPU_MEM_TYPE_GPU_CPU flag. If GDRCopy is not detected, DOCA GPUNetIO will log warning messages.
If GDRCopy is not required for your application, you can safely ignore the related warning messages. To use GDRCopy, ensure its installation path is included in the LD_LIBRARY_PATH environment variable or specified using the GDRCOPY_PATH_L environment variable.
GPU Memory Mapping
To enable the NIC to send and receive packets using GPU memory, a memory mapping mechanism must be used. DOCA supports two methods:
dmabuf(default method): The preferred, modern method for mapping GPU memory.nvidia-peermem(fallback method): A legacy method used ifdmabufis not available or fails.
Using dmabuf
This is the primary method for mapping GPU memory. The prerequisites for this approach are:
Linux Kernel version 6.2 or later
libibverbsversion 1.14.44 or laterCUDA Toolkit:
Version 12.5 or older: Must be installed with the
-m=kernel-openflag (implying open-source NVIDIA driver mode).Version 12.6 or newer: Open kernel mode is enabled by default.
Please ensure your system has nvidia-open drivers installed. If it shows cuda-drivers instead, it means the NVIDIA driver is installed as close source version and then dmabuf can't be used.
Using nvidia-peermem
This method is used if dmabuf is unavailable. It requires the nvidia-peermem kernel module, which is installed with the CUDA Toolkit, to be loaded:
Launch nvidia-peermem
sudo modprobe nvidia-peermem
Implementation and Fallback Logic
The recommended implementation is to attempt to get a dmabuf file descriptor first. If that fails, the application should fall back to the nvidia-peermem method.
The following code snippet demonstrates how to use dmabuf for GPU memory mapping with DOCA mmap, including the fallback logic:
GPU Configuration
/* Get the dmabuf file-descriptor for the GPU memory buffer from CUDA */
result = doca_gpu_dmabuf_fd(gpu_dev, gpu_buffer_addr, gpu_buffer_size, &(dmabuf_fd));
if (result != DOCA_SUCCESS) {
/* Fallback to nvidia-peermem legacy method if dmabuf fails */
doca_mmap_set_memrange(gpu_buffer_mmap, gpu_buffer_addr, gpu_buffer_size);
} else {
/* Create DOCA mmap using dmabuf */
doca_mmap_set_dmabuf_memrange(gpu_buffer_mmap, dmabuf_fd, gpu_buffer_addr, 0, gpu_buffer_size);
}
Handling dmabuf Failure
A failure in doca_gpu_dmabuf_fd (the if block in the example) likely indicates that the NVIDIA driver is not in open-source mode.
When doca_mmap_start is subsequently called, DOCA will attempt to map the GPU memory. If dmabuf was not set, it will automatically fall back to the legacy nvidia-peermem method. In this case, the following warning message is logged:
GPU Configuration
[DOCA][WRN][linux_devx_adapter.cpp:374] devx adapter 0x5566a16018e0: Registration using dmabuf is not supported, falling back to legacy registration
If your application can rely on nvidia-peermem and does not strictly require dmabuf, this warning message can be safely ignored.
Sample Implementations
GPUNetIO Ethernet samples use DOCA mmap with
dmabufandnvidia-peermemas the fallback (following the logic in the code example above).GPUNetIO Verbs samples show an alternative verbs-based method, using
ibv_reg_dmabuf_mr(fordmabuf) andibv_reg_mr(as the fallback).
GPU BAR1 Size
Every time a GPU buffer is mapped to the NIC (e.g., buffers associated with send or receive queues), a portion of the GPU BAR1 mapping space is used. Therefore, it is important to check that the BAR1 mapping is large enough to hold all the bytes the DOCA GPUNetIO application is trying to map. To verify the BAR1 mapping space of a GPU you can use nvidia-smi:
BAR1 mapping
$ nvidia-smi -q
==============NVSMI LOG==============
.....
Attached GPUs : 1
GPU 00000000:CA:00.0
Product Name : NVIDIA A100 80GB PCIe
Product Architecture : Ampere
Persistence Mode : Enabled
.....
BAR1 Memory Usage
Total : 131072 MiB
Used : 1 MiB
Free : 131071 MiB
By default, some GPUs (e.g. RTX models) may have a very small BAR1 size:
BAR1 mapping
$ nvidia-smi -q | grep -i bar -A 3
BAR1 Memory Usage
Total : 256 MiB
Used : 6 MiB
Free : 250 MiB
If the BAR1 size is not enough, DOCA GPUNetIO applications may exit with errors because DOCA mmap fails to map the GPU memory buffers to the NIC (e.g., Failed to start mmap DOCA Driver call failure). To overcome this issue, the GPU BAR1 must be increased from the BIOS. The system should have "Resizable BAR" option enabled. For further information, refer to this NVIDIA forum post.
Running without Root Privileges
All DOCA GPUNetIO samples and applications using Ethernet rely on DOCA Flow. Therefore, they must be executed with sudo or root privileges.
However, Verbs, RDMA and DMA samples can be run without sudo privileges if a specific option is enabled in the NVIDIA driver:
Create a configuration file for the NVIDIA driver:
cat<<EOF |sudotee/etc/modprobe.d/nvidia.conf options nvidia NVreg_RegistryDwords="PeerMappingOverride=1;"EOFPerform a cold reboot to ensure the changes take effect.
Verify that the configuration has been applied using the following command:
$
grepRegistryDwords /proc/driver/nvidia/paramsYou should see the following output confirming the setting:
RegistryDwords:
"PeerMappingOverride=1;"
Due to hardware topology limitations, DGX Spark does not support GPUDirect RDMA. However, DOCA GPUNetIO applications can still execute on these systems by utilizing CPU pinned memory (DOCA_GPU_MEM_TYPE_CPU_GPU) instead of GPU memory. This applies to both queue allocation and packet/data memory.
Ethernet Configuration
When creating Rx or Tx queues with DOCA Ethernet, you must use the setters in doca_eth_rxq_gpu_data_path.h and doca_eth_txq_gpu_data_path.h to allocate queues on the CPU-GPU shared memory:
DOCA Ethernet queue memory setters
doca_error_t doca_eth_rxq_gpu_set_rq_mem_type(struct doca_eth_rxq *eth_rxq, enum doca_gpu_mem_type rq_mem_type);
doca_error_t doca_eth_txq_gpu_set_sq_mem_type(struct doca_eth_txq *eth_txq, enum doca_gpu_mem_type sq_mem_type);
For Tx queues, you must also enable CPU proxy mode to handle transmission:
DOCA Ethernet CPU proxy mode
doca_error_t doca_eth_txq_gpu_set_uar_on_cpu(struct doca_eth_txq *eth_txq);
See the gpunetio_simple_send_sample for an implementation of CPU proxy on the data path.
Verbs RDMA Configuration
A similar approach applies to GPUNetIO Verbs applications, with the distinction that the application is responsible for explicitly allocating the queue memory (QP UMEM) using the DOCA_GPU_MEM_TYPE_CPU_GPU memory type.
For the send side, CPU proxy mode is also supported. Refer to the following GPUNetIO Verbs examples for implementation details:
gpunetio_verbs_put_bw_samplegpunetio_verbs_put_counter_bw_samplegpunetio_verbs_write_bw_samplegpunetio_verbs_write_lat_sample