DOCA Documentation v3.2.0

DOCA DPU GPU Remote Offload Application Guide

The DOCA DPU GPU Remote Offload applications serve as a reference implementation, demonstrating how to develop a GPU offload application using DOCA Comch and DOCA GPUNetIO on the NVIDIA BlueField platform with NVIDIA GPUs.

These applications are simplified for clarity. A real-world solution would require more complexity tailored to a specific use case.

The general architecture involves:

  • A server application running on the DPU.

  • An orchestrator application running on the host CPU, which launches a CUDA Kernel.

The server receives TCP requests from remote clients and forwards them to the GPU via DOCA Comch. The GPU processes the message (in this reference, by reversing the byte order) and returns a response to the server. The server then forwards this response to the remote clients over TCP.

DOCA DPU GPU Remote Offload consists of three separate applications that work together to offload work from a DPU to a host-side GPU.

image-2025-10-22_14-48-16-version-1-modificationdate-1761140897317-api-v2.png

doca_dpu_gpu_remote_offload_server

This application runs on the BlueField DPU and acts as both a DOCA Comch server and a TCP server.

  • DOCA Comch: Used to communicate with the Orchestrator and GPU.

  • TCP: Used to communicate with remote clients for requests and responses

doca_dpu_gpu_remote_offload_orchestrator

This application configures and launches the CUDA Kernel for processing. It connects to the server via DOCA Comch, exchanges configuration information, and then launches the kernel . The Orchestrator application continues to run, monitoring the Comch connection and CUDA Kernel, and handles clean shutdown.

doca_dpu_gpu_remote_offload_client

This application sends requests to and receives responses from the server. It validates the responses and displays throughput figures.

doca_dpu_gpu_remote_offload_server

The server application executes in two main phases: a Control Phase and a TCP Server Phase.

Control Phase

This phase begins when the application is launched. It configures the DOCA Comch server, waits for the orchestrator application to connect, and spawns threads to handle incoming messages. Once producers and consumers are connected, it creates a TCP listening socket.

A typical flow is:

  1. Open the required DOCA device.

  2. Create the DOCA Comch Server.

  3. Wait for the Orchestrator application to connect.

  4. Prepare the appropriate number of Socket threads. Each thread will:

    • Create DOCA Comch Producers and Consumers.

    • Wait for Producers and Consumers to be connected.

    • Create and pool doca_comch_producer_send_tasks.

    • Create and submit doca_comch_consumer_post_recv_tasks.

  5. Create a TCP socket to listen for connections.

  6. Start the TCP Server phase.

TCP Server Phase

In this phase, the application listens on the configured TCP port for incoming connections. When a client connects, its thread forwards messages to the GPU via DOCA Comch. It continues sending until the socket has no more data or the send_task pool is empty. It then waits for a response from the GPU, which is forwarded back to the remote client.

A typical per-thread flow loop is:

  1. Poll the DOCA PE

    • If a doca_comch_producer_send_task completion is received, add the task back to the task pool.

    • If a doca_comch_consumer_post_recv_task completion is received:

      • Extract the response data.

      • Write the response to the TCP socket.

  2. Poll TCP socket

    • If data can be read from the socket:

      • Read data from the socket.

      • Get the next available doca_comch_consumer_post_recv task (this should be doca_comch_producer_send_task based on context).

      • Copy message contents into the send task's data buffer.

      • Submit the send task.

In this phase, the application also listens for DOCA Comch control messages from the orchestrator or a CTRL-C signal to ensure a clean shutdown.

doca_dpu_gpu_remote_offload_orchestrator

This application executes in two phases: a Control Phase and a GPU Processing Phase.

Control Phase

This phase configures the DOCA Comch client and connects to the server. It creates producers and consumers, allocates GPU memory, waits for connections, and then launches the CUDA Kernel.

  1. Open the required DOCA device.

  2. Open the required GPU device.

  3. Create the DOCA Comch client.

  4. Connect to the DOCA Comch server.

  5. Allocate the appropriate GPU memory.

  6. Prepare the appropriate number of Producers and Consumers.

  7. Launch the CUDA Kernel.

GPU Processing Phase

Once the CUDA Kernel is launched, the CPU portion of the application remains running. It monitors the kernel's execution and listens for DOCA Comch control shutdown messages (from the server or a user CTRL-C signal). If a shutdown is detected, it stops the kernel and cleans up memory.

The GPU processing portion initially submits multiple post_recv buffers. It then enters a loop, polling for messages. When a message is received, it processes it (reverses the bytes), sends the response back to the server, and resubmits the buffer. This continues until a fatal error or a global stop flag is set by the CPU.

A typical CUDA thread loop is:

  1. Poll for post_recv messages

    • If a message is received:

      • Extract the data.

      • Verify the message is a client request.

      • Reverse the order of the bytes in the data.

      • Record the buffer in the inflight_messages array.

      • Submit the response to the server using doca_dev_gpu_comch_producer_send.

  2. Poll for producer send message completions

    • If a send completion is indicated:

      • Use user_msg_id to determine which buffer was sent.

      • Submit this buffer to receive a new message with doca_dev_gpu_comch_consumer_post_recv.

doca_dpu_gpu_remote_offload_client

This application is simpler than the other two in the use case. Its purpose is to initiate TCP connections to the doca_dpu_gpu_remote_offload_server, using one connection per thread.

Each thread sends a specified number of request messages while recording throughput. Upon receiving a response, the client validates its content against the expected response; if they do not match, the application exits with an error.

If all requests are sent and successfully validated, the application outputs final statistics for the run, including:

  • Run length

  • Operation count

  • Transmit and receive data rates

  • IO rate

Server

The doca_dpu_gpu_remote_offload_server application leverages the following DOCA libraries:

Orchestrator

The doca_dpu_gpu_remote_offload_server application leverages the following DOCA libraries:

Client

The doca_dpu_gpu_remote_offload_client application does not leverage any DOCA libraries, however it does utilize DOCA Arg Parser.

Warning

The doca_dpu_gpu_remote_offload_orchestrator application requires CUDA version 13. Compilation will fail if an older version is used.

The doca_dpu_gpu_remote_offload_server application will only compile on the DPU, and the doca_dpu_gpu_remote_offload_orchestrator will only compile on x86 hosts. The doca_dpu_gpu_remote_offload_client application compiles on all architectures.

Compiling All Applications

To build all DOCA applications together:

Copy
Copied!
            

cd /opt/mellanox/doca/applications/ meson /tmp/build ninja -C /tmp/build

Info

The applications are created under /tmp/build/dpu_gpu_remote_offload/.


Compiling Only DPU GPU Remote Offload Applications

To directly build only the storage applications:

Copy
Copied!
            

cd /opt/mellanox/doca/applications/ meson /tmp/build -Denable_all_applications=false -Denable_dpu_gpu_remote_offload=true ninja -C /tmp/build

Alternatively, you can edit /opt/mellanox/doca/applications/meson_options.txt:

  1. Set enable_all_applications to false.

  2. Set enable_dpu_gpu_remote_offload to true.

  3. Run the compilation commands:

    Copy
    Copied!
                

    cd /opt/mellanox/doca/applications/ meson /tmp/build ninja -C /tmp/build

Info

The applications are created under /tmp/build/dpu_gpu_remote_offload/.


The DOCA DPU GPU Remote Offload applications are provided in source form and must be compiled before execution.

Server

Note

The server application must run on the NVIDIA BlueField DPU and must be started before the orchestrator or client.

Application Usage Instructions

Copy
Copied!
            

Usage: doca_dpu_gpu_remote_offload_server [DOCA Flags] [Program Flags]   DOCA Flags: -h, --help Print a help synopsis -v, --version Print program version information -l, --log-level Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> --sdk-log-level Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> -j, --json <path> Parse command line flags from an input json file   Program Flags: -d, --device-id <DEV ID> Device ID (mandatory). -r, --representor-id <REPRESENTOR ID> Representor ID (mandatory). -c, --comch-channel-name <DEV ID> Comch channel name (optional). -p, --server-listen-port <PORT> Server listen port (mandatory). --cpu <CPU> CPU to use when executing data path operations (mandatory). May be repeated for more cores --max-concurrent-messages <NUM_MESSAGES> Set the maximum number of concurrent messages that can be processed per thread (optional). --max-message-length <LENGTH> Set the maximum length of a message that can be processed (exclusive of header) (optional).


CLI Example of the Application

Copy
Copied!
            

./doca_dpu_gpu_remote_offload_server --device-id 03:00.0 --representor-id d8:00.0 \ -p 12345 --max-concurrent-messages 128 --cpu 1 --cpu 2

Note

The DOCA Comch device PCIe address (03:00.0) and the representor PCIe address (d8:00.0) must match the addresses of the desired devices.


Command Line Flags

Flag Type

Short Flag

Long Flag

Description

General flags

h

help

Prints a help synopsis

v

version

Prints program version information

l

log-level

Set the log level for the application:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70 (requires compilation with TRACE log level support)

N/A

sdk-log-level

Sets the log level for the program:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70

j

json

Parse command line flags from an input JSON file as well as from the CLI (if provided)

Program flags

d

device-id

Comm Channel DOCA device PCIe address.

Note

This is a mandatory flag

r

representor-id

Comm Channel DOCA device representor PCIe address.

Note

This is a mandatory flag

c

comm-channel-name

A custom name for the DOCA Comch connection, this must be the same on both server and orchestrator.

c

server-listen-port

The TCP port the server will use to listen for client connections.

N/A

cpu

CPU to use when executing data path operations, this option can be repeated to use multiple CPU cores in execution.

Note

This is a mandatory flag

N/A

max-concurrent-messages

The maximum number of concurrent messages that can be processed by each thread.

N/A

max-message-length

The maximum length or message, in bytes, that can be supported.

Orchestrator

Note

The orchestrator application runs on the host CPU and must be started after the server application.

Application Usage Instructions

Copy
Copied!
            

Usage: doca_dpu_gpu_remote_offload_orchestrator [DOCA Flags] [Program Flags]   DOCA Flags: -h, --help Print a help synopsis -v, --version Print program version information -l, --log-level Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> --sdk-log-level Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> -j, --json <path> Parse command line flags from an input json file   Program Flags: -d, --device-id <DEV ID> Device ID (mandatory). -g, --gpu-pci-addr <PCI ADDRESS> GPU PCIe Address (mandatory). -c, --comch-channel-name <DEV ID> Comch channel name (optional). -t, --num-gpu-threads <NUM> Number of GPU threads to use when executing data path operations (mandatory). Must be the same as core count on server. --max-concurrent-messages <NUM_MESSAGES> Set the maximum number of concurrent messages that can be processed per thread (optional). --max-message-length <LENGTH> Set the maximum length of a message that can be processed (exclusive of header) (optional).


CLI Example of the Application

Copy
Copied!
            

./doca_dpu_gpu_remote_offload_orchestrator --device-id d8:00.0 --gpu-pci-addr b5:00.0 \ --max-concurrent-messages 128 --num-gpu-threads 2

Note

The device PCIe address (d8:00.0) and GPU PCIe address (b5:00.0) must match your devices.

Note

The number of GPU threads should be the same as the total number of CPU cores provided to the server.


Command Line Flags

Flag Type

Short Flag

Long Flag

Description

General flags

h

help

Prints a help synopsis

v

version

Prints program version information

l

log-level

Set the log level for the application:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70 (requires compilation with TRACE log level support)

N/A

sdk-log-level

Sets the log level for the program:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70

j

json

Parse command line flags from an input JSON file as well as from the CLI (if provided)

Program flags

d

device-id

Comm Channel DOCA device PCIe address.

Note

This is a mandatory flag

g

gpu-pci-addr

The PCIe address of the GPU to process messages

Note

This is a mandatory flag

c

comm-channel-name

A custom name for the DOCA Comch connection, this must be the same on both server and orchestrator.

t

num-gpu-threads

The number of GPU threads to use in processing messages. This number should be the same as the total number of CPU cores assigned to the server

Note

This is a mandatory flag

N/A

max-concurrent-messages

The maximum number of concurrent messages that can be processed by each thread.

N/A

max-message-length

The maximum length or message, in bytes, that can be supported.

Client

Note

The client must be started after the server and orchestrator have established their Comch connection.

Application Usage Instructions

Copy
Copied!
            

Usage: doca_dpu_gpu_remote_offload_client [DOCA Flags] [Program Flags]   DOCA Flags: -h, --help Print a help synopsis -v, --version Print program version information -l, --log-level Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> --sdk-log-level Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE> -j, --json <path> Parse command line flags from an input json file   Program Flags: -s, --server-ip-address <IP ADDR> Server IP address (mandatory). -p, --server-ip-port <IP PORT> Server IP port (mandatory). -t, --thread-count <THREAD_COUNT> Thread count (mandatory). -i, --iteration-count <ITERATION_COUNT> Iteration count (mandatory). -m, --message-string <MESSAGE> message string (mandatory). -e, --expected-response <EXPECTED RESPONSE> Expected response (mandatory). --max-concurrent-messages <NUM_MESSAGES> Set the maximum number of concurrent messages that can be processed per thread (optional).


CLI Example of the Application

Copy
Copied!
            

./doca_dpu_gpu_remote_offload_client -s 172.17.0.1 -p 12345 -t 1 -i 1 -m "ABCD" -e "DCBA"

Info

The GPU message response will contain the bytes in the request in reverse order.


Command Line Flags

Flag Type

Short Flag

Long Flag

Description

General flags

h

help

Prints a help synopsis

v

version

Prints program version information

l

log-level

Set the log level for the application:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70 (requires compilation with TRACE log level support)

N/A

sdk-log-level

Sets the log level for the program:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70

j

json

Parse command line flags from an input JSON file as well as from the CLI (if provided)

Program flags

s

server-ip-address

The IP address of the doca_dpu_gpu_remote_offload_server.

Note

This is a mandatory flag

p

server-ip-port

The TCP port assigned to the doca_dpu_gpu_remote_offload_server.

Note

This is a mandatory flag

t

thread-count

The number of threads to use in execution. Each thread will send iteration-count number of messages.

Note

This is a mandatory flag

i

iteration-count

The number requests to be sent from each execution thread.

Note

This is a mandatory flag

m

message-string

The message to be sent to the server

Note

This is a mandatory flag

i

expected-response

The expected response that should be received back from the server.

Info

This should be the bytes of the initial message in reverse order.

Note

This is a mandatory flag

N/A

max-concurrent-messages

The maximum number of concurrent messages that can be processed by each thread.

© Copyright 2025, NVIDIA. Last updated on Nov 20, 2025