DOCA DPU GPU Remote Offload Application Guide

Introduction

The DOCA DPU GPU Remote Offload applications serve as a reference implementation, demonstrating how to develop a GPU offload application using DOCA Comch and DOCA GPUNetIO on the NVIDIA BlueField platform with NVIDIA GPUs.

These applications are simplified for clarity. A real-world solution would require more complexity tailored to a specific use case.

The general architecture involves:

A server application running on the DPU.
An orchestrator application running on the host CPU, which launches a CUDA Kernel.

The server receives TCP requests from remote clients and forwards them to the GPU via DOCA Comch. The GPU processes the message (in this reference, by reversing the byte order) and returns a response to the server. The server then forwards this response to the remote clients over TCP.

Applications

DOCA DPU GPU Remote Offload consists of three separate applications that work together to offload work from a DPU to a host-side GPU.

image-2025-10-22_14-48-16-version-1-modificationdate-1761140897317-api-v2.png

doca_dpu_gpu_remote_offload_server

This application runs on the BlueField DPU and acts as both a DOCA Comch server and a TCP server.

DOCA Comch: Used to communicate with the Orchestrator and GPU.
TCP: Used to communicate with remote clients for requests and responses

doca_dpu_gpu_remote_offload_orchestrator

This application configures and launches the CUDA Kernel for processing. It connects to the server via DOCA Comch, exchanges configuration information, and then launches the kernel . The Orchestrator application continues to run, monitoring the Comch connection and CUDA Kernel, and handles clean shutdown.

doca_dpu_gpu_remote_offload_client

This application sends requests to and receives responses from the server. It validates the responses and displays throughput figures.

Architecture

doca_dpu_gpu_remote_offload_server

The server application executes in two main phases: a Control Phase and a TCP Server Phase.

Control Phase

This phase begins when the application is launched. It configures the DOCA Comch server, waits for the orchestrator application to connect, and spawns threads to handle incoming messages. Once producers and consumers are connected, it creates a TCP listening socket.

A typical flow is:

Open the required DOCA device.
Create the DOCA Comch Server.
Wait for the Orchestrator application to connect.
Prepare the appropriate number of Socket threads. Each thread will:
- Create DOCA Comch Producers and Consumers.
- Wait for Producers and Consumers to be connected.
- Create and pool doca_comch_producer_send_tasks.
- Create and submit doca_comch_consumer_post_recv_tasks.
Create a TCP socket to listen for connections.
Start the TCP Server phase.

TCP Server Phase

In this phase, the application listens on the configured TCP port for incoming connections. When a client connects, its thread forwards messages to the GPU via DOCA Comch. It continues sending until the socket has no more data or the send_task pool is empty. It then waits for a response from the GPU, which is forwarded back to the remote client.

A typical per-thread flow loop is:

Poll the DOCA PE
- If a doca_comch_producer_send_task completion is received, add the task back to the task pool.
- If a doca_comch_consumer_post_recv_task completion is received:
  - Extract the response data.
  - Write the response to the TCP socket.
Poll TCP socket
- If data can be read from the socket:
  - Read data from the socket.
  - Get the next available doca_comch_consumer_post_recv task (this should be doca_comch_producer_send_task based on context).
  - Copy message contents into the send task's data buffer.
  - Submit the send task.

In this phase, the application also listens for DOCA Comch control messages from the orchestrator or a CTRL-C signal to ensure a clean shutdown.

doca_dpu_gpu_remote_offload_orchestrator

This application executes in two phases: a Control Phase and a GPU Processing Phase.

Control Phase

This phase configures the DOCA Comch client and connects to the server. It creates producers and consumers, allocates GPU memory, waits for connections, and then launches the CUDA Kernel.

Open the required DOCA device.
Open the required GPU device.
Create the DOCA Comch client.
Connect to the DOCA Comch server.
Allocate the appropriate GPU memory.
Prepare the appropriate number of Producers and Consumers.
Launch the CUDA Kernel.

GPU Processing Phase

Once the CUDA Kernel is launched, the CPU portion of the application remains running. It monitors the kernel's execution and listens for DOCA Comch control shutdown messages (from the server or a user CTRL-C signal). If a shutdown is detected, it stops the kernel and cleans up memory.

The GPU processing portion initially submits multiple post_recv buffers. It then enters a loop, polling for messages. When a message is received, it processes it (reverses the bytes), sends the response back to the server, and resubmits the buffer. This continues until a fatal error or a global stop flag is set by the CPU.

A typical CUDA thread loop is:

Poll for post_recv messages
- If a message is received:
  - Extract the data.
  - Verify the message is a client request.
  - Reverse the order of the bytes in the data.
  - Record the buffer in the inflight_messages array.
  - Submit the response to the server using doca_dev_gpu_comch_producer_send.
Poll for producer send message completions
- If a send completion is indicated:
  - Use user_msg_id to determine which buffer was sent.
  - Submit this buffer to receive a new message with doca_dev_gpu_comch_consumer_post_recv.

doca_dpu_gpu_remote_offload_client

This application is simpler than the other two in the use case. Its purpose is to initiate TCP connections to the doca_dpu_gpu_remote_offload_server, using one connection per thread.

Each thread sends a specified number of request messages while recording throughput. Upon receiving a response, the client validates its content against the expected response; if they do not match, the application exits with an error.

If all requests are sent and successfully validated, the application outputs final statistics for the run, including:

Run length
Operation count
Transmit and receive data rates
IO rate

DOCA Libraries

Server

The doca_dpu_gpu_remote_offload_server application leverages the following DOCA libraries:

DOCA Comch

Orchestrator

The doca_dpu_gpu_remote_offload_server application leverages the following DOCA libraries:

Client

The doca_dpu_gpu_remote_offload_client application does not leverage any DOCA libraries, however it does utilize DOCA Arg Parser.

Compiling the Applications

Warning

The doca_dpu_gpu_remote_offload_orchestrator application requires CUDA version 13. Compilation will fail if an older version is used.

The doca_dpu_gpu_remote_offload_server application will only compile on the DPU, and the doca_dpu_gpu_remote_offload_orchestrator will only compile on x86 hosts. The doca_dpu_gpu_remote_offload_client application compiles on all architectures.

Compiling All Applications

To build all DOCA applications together:

Copy
Copied!

            
            cd /opt/mellanox/doca/applications/
meson /tmp/build
ninja -C /tmp/build

Info

The applications are created under /tmp/build/dpu_gpu_remote_offload/.

Compiling Only DPU GPU Remote Offload Applications

To directly build only the storage applications:

Copy
Copied!

            
            cd /opt/mellanox/doca/applications/
meson /tmp/build -Denable_all_applications=false -Denable_dpu_gpu_remote_offload=true
ninja -C /tmp/build

Alternatively, you can edit /opt/mellanox/doca/applications/meson_options.txt:

Set enable_all_applications to false.
Set enable_dpu_gpu_remote_offload to true.

Run the compilation commands:

Copy
Copied!

            
            cd /opt/mellanox/doca/applications/
meson /tmp/build
ninja -C /tmp/build

Info

The applications are created under /tmp/build/dpu_gpu_remote_offload/.

Running the Applications

The DOCA DPU GPU Remote Offload applications are provided in source form and must be compiled before execution.

Server

Note

The server application must run on the NVIDIA BlueField DPU and must be started before the orchestrator or client.

Application Usage Instructions

Copy
Copied!

            
            Usage: doca_dpu_gpu_remote_offload_server [DOCA Flags] [Program Flags]
 
DOCA Flags:
  -h, --help                        Print a help synopsis
  -v, --version                     Print program version information
  -l, --log-level                   Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
  --sdk-log-level                   Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
  -j, --json <path>                 Parse command line flags from an input json file
 
Program Flags:
  -d, --device-id <DEV ID>          Device ID (mandatory).
  -r, --representor-id <REPRESENTOR ID> Representor ID (mandatory).
  -c, --comch-channel-name <DEV ID> Comch channel name (optional).
  -p, --server-listen-port <PORT>   Server listen port (mandatory).
  --cpu <CPU>                       CPU to use when executing data path operations (mandatory). May be repeated for more cores
  --max-concurrent-messages <NUM_MESSAGES> Set the maximum number of concurrent messages that can be processed per thread (optional).
  --max-message-length <LENGTH>     Set the maximum length of a message that can be processed (exclusive of header) (optional).

CLI Example of the Application

Copy
Copied!

            
            ./doca_dpu_gpu_remote_offload_server --device-id 03:00.0 --representor-id d8:00.0 \
    -p 12345 --max-concurrent-messages 128 --cpu 1 --cpu 2

Note

The DOCA Comch device PCIe address (03:00.0) and the representor PCIe address (d8:00.0) must match the addresses of the desired devices.

Command Line Flags

Flag Type	Short Flag	Long Flag	Description
General flags	`h`	`help`	Prints a help synopsis
	`v`	`version`	Prints program version information
	`l`	`log-level`	Set the log level for the application: DISABLE=10 CRITICAL=20 ERROR=30 WARNING=40 INFO=50 DEBUG=60 TRACE=70 (requires compilation with `TRACE` log level support)
	N/A	`sdk-log-level`	Sets the log level for the program: DISABLE=10 CRITICAL=20 ERROR=30 WARNING=40 INFO=50 DEBUG=60 TRACE=70
	`j`	`json`	Parse command line flags from an input JSON file as well as from the CLI (if provided)
Program flags	`d`	`device-id`	Comm Channel DOCA device PCIe address. Note This is a mandatory flag
	`r`	`representor-id`	Comm Channel DOCA device representor PCIe address. Note This is a mandatory flag
	`c`	`comm-channel-name`	A custom name for the DOCA Comch connection, this must be the same on both server and orchestrator.
	`c`	`server-listen-port`	The TCP port the server will use to listen for client connections.
	N/A	`cpu`	CPU to use when executing data path operations, this option can be repeated to use multiple CPU cores in execution. Note This is a mandatory flag
	N/A	`max-concurrent-messages`	The maximum number of concurrent messages that can be processed by each thread.
	N/A	`max-message-length`	The maximum length or message, in bytes, that can be supported.

Orchestrator

Note

The orchestrator application runs on the host CPU and must be started after the server application.

Application Usage Instructions

Copy
Copied!

            
            Usage: doca_dpu_gpu_remote_offload_orchestrator [DOCA Flags] [Program Flags]
 
DOCA Flags:
  -h, --help                        Print a help synopsis
  -v, --version                     Print program version information
  -l, --log-level                   Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
  --sdk-log-level                   Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
  -j, --json <path>                 Parse command line flags from an input json file
 
Program Flags:
  -d, --device-id <DEV ID>          Device ID (mandatory).
  -g, --gpu-pci-addr <PCI ADDRESS>  GPU PCIe Address (mandatory).
  -c, --comch-channel-name <DEV ID> Comch channel name (optional).
  -t, --num-gpu-threads <NUM>       Number of GPU threads to use when executing data path operations (mandatory). Must be the same as core count on server.
  --max-concurrent-messages <NUM_MESSAGES> Set the maximum number of concurrent messages that can be processed per thread (optional).
  --max-message-length <LENGTH>     Set the maximum length of a message that can be processed (exclusive of header) (optional).

CLI Example of the Application

Copy
Copied!

            
            ./doca_dpu_gpu_remote_offload_orchestrator --device-id d8:00.0 --gpu-pci-addr b5:00.0 \
    --max-concurrent-messages 128 --num-gpu-threads 2

Note

The device PCIe address (d8:00.0) and GPU PCIe address (b5:00.0) must match your devices.

Note

The number of GPU threads should be the same as the total number of CPU cores provided to the server.

Command Line Flags

Flag Type	Short Flag	Long Flag	Description
General flags	`h`	`help`	Prints a help synopsis
	`v`	`version`	Prints program version information
	`l`	`log-level`	Set the log level for the application: DISABLE=10 CRITICAL=20 ERROR=30 WARNING=40 INFO=50 DEBUG=60 TRACE=70 (requires compilation with `TRACE` log level support)
	N/A	`sdk-log-level`	Sets the log level for the program: DISABLE=10 CRITICAL=20 ERROR=30 WARNING=40 INFO=50 DEBUG=60 TRACE=70
	`j`	`json`	Parse command line flags from an input JSON file as well as from the CLI (if provided)
Program flags	`d`	`device-id`	Comm Channel DOCA device PCIe address. Note This is a mandatory flag
	`g`	`gpu-pci-addr`	The PCIe address of the GPU to process messages Note This is a mandatory flag
	`c`	`comm-channel-name`	A custom name for the DOCA Comch connection, this must be the same on both server and orchestrator.
	`t`	`num-gpu-threads`	The number of GPU threads to use in processing messages. This number should be the same as the total number of CPU cores assigned to the server Note This is a mandatory flag
	N/A	`max-concurrent-messages`	The maximum number of concurrent messages that can be processed by each thread.
	N/A	`max-message-length`	The maximum length or message, in bytes, that can be supported.

Client

Note

The client must be started after the server and orchestrator have established their Comch connection.

Application Usage Instructions

Copy
Copied!

            
            Usage: doca_dpu_gpu_remote_offload_client [DOCA Flags] [Program Flags]
 
DOCA Flags:
  -h, --help                        Print a help synopsis
  -v, --version                     Print program version information
  -l, --log-level                   Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
  --sdk-log-level                   Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
  -j, --json <path>                 Parse command line flags from an input json file
 
Program Flags:
  -s, --server-ip-address <IP ADDR> Server IP address (mandatory).
  -p, --server-ip-port <IP PORT>    Server IP port (mandatory).
  -t, --thread-count <THREAD_COUNT> Thread count (mandatory).
  -i, --iteration-count <ITERATION_COUNT> Iteration count (mandatory).
  -m, --message-string <MESSAGE>    message string (mandatory).
  -e, --expected-response <EXPECTED RESPONSE> Expected response (mandatory).
  --max-concurrent-messages <NUM_MESSAGES> Set the maximum number of concurrent messages that can be processed per thread (optional).

CLI Example of the Application

Copy
Copied!

            
            ./doca_dpu_gpu_remote_offload_client -s 172.17.0.1 -p 12345 -t 1 -i 1 -m "ABCD" -e "DCBA"

Info

The GPU message response will contain the bytes in the request in reverse order.

Command Line Flags

Flag Type	Short Flag	Long Flag	Description
General flags	`h`	`help`	Prints a help synopsis
	`v`	`version`	Prints program version information
	`l`	`log-level`	Set the log level for the application: DISABLE=10 CRITICAL=20 ERROR=30 WARNING=40 INFO=50 DEBUG=60 TRACE=70 (requires compilation with `TRACE` log level support)
	N/A	`sdk-log-level`	Sets the log level for the program: DISABLE=10 CRITICAL=20 ERROR=30 WARNING=40 INFO=50 DEBUG=60 TRACE=70
	`j`	`json`	Parse command line flags from an input JSON file as well as from the CLI (if provided)
Program flags	`s`	`server-ip-address`	The IP address of the `doca_dpu_gpu_remote_offload_server`. Note This is a mandatory flag
	`p`	`server-ip-port`	The TCP port assigned to the `doca_dpu_gpu_remote_offload_server`. Note This is a mandatory flag
	`t`	`thread-count`	The number of threads to use in execution. Each thread will send `iteration-count` number of messages. Note This is a mandatory flag
	`i`	`iteration-count`	The number requests to be sent from each execution thread. Note This is a mandatory flag
	`m`	`message-string`	The message to be sent to the server Note This is a mandatory flag
	`i`	`expected-response`	The expected response that should be received back from the server. Info This should be the bytes of the initial message in reverse order. Note This is a mandatory flag
	N/A	`max-concurrent-messages`	The maximum number of concurrent messages that can be processed by each thread.

On This Page