DOCA DPU GPU Remote Offload Application Guide
The DOCA DPU GPU Remote Offload applications serve as a reference implementation, demonstrating how to develop a GPU offload application using DOCA Comch and DOCA GPUNetIO on the NVIDIA BlueField platform with NVIDIA GPUs.
These applications are simplified for clarity. A real-world solution would require more complexity tailored to a specific use case.
The general architecture involves:
A server application running on the DPU.
An orchestrator application running on the host CPU, which launches a CUDA Kernel.
The server receives TCP requests from remote clients and forwards them to the GPU via DOCA Comch. The GPU processes the message (in this reference, by reversing the byte order) and returns a response to the server. The server then forwards this response to the remote clients over TCP.
DOCA DPU GPU Remote Offload consists of three separate applications that work together to offload work from a DPU to a host-side GPU.
doca_dpu_gpu_remote_offload_server
This application runs on the BlueField DPU and acts as both a DOCA Comch server and a TCP server.
DOCA Comch: Used to communicate with the Orchestrator and GPU.
TCP: Used to communicate with remote clients for requests and responses
doca_dpu_gpu_remote_offload_orchestrator
This application configures and launches the CUDA Kernel for processing. It connects to the server via DOCA Comch, exchanges configuration information, and then launches the kernel . The Orchestrator application continues to run, monitoring the Comch connection and CUDA Kernel, and handles clean shutdown.
doca_dpu_gpu_remote_offload_client
This application sends requests to and receives responses from the server. It validates the responses and displays throughput figures.
doca_dpu_gpu_remote_offload_server
The server application executes in two main phases: a Control Phase and a TCP Server Phase.
Control Phase
This phase begins when the application is launched. It configures the DOCA Comch server, waits for the orchestrator application to connect, and spawns threads to handle incoming messages. Once producers and consumers are connected, it creates a TCP listening socket.
A typical flow is:
Open the required DOCA device.
Create the DOCA Comch Server.
Wait for the Orchestrator application to connect.
Prepare the appropriate number of Socket threads. Each thread will:
Create DOCA Comch Producers and Consumers.
Wait for Producers and Consumers to be connected.
Create and pool
doca_comch_producer_send_tasks.Create and submit
doca_comch_consumer_post_recv_tasks.
Create a TCP socket to listen for connections.
Start the TCP Server phase.
TCP Server Phase
In this phase, the application listens on the configured TCP port for incoming connections. When a client connects, its thread forwards messages to the GPU via DOCA Comch. It continues sending until the socket has no more data or the send_task pool is empty. It then waits for a response from the GPU, which is forwarded back to the remote client.
A typical per-thread flow loop is:
Poll the DOCA PE
If a
doca_comch_producer_send_taskcompletion is received, add the task back to the task pool.If a
doca_comch_consumer_post_recv_taskcompletion is received:Extract the response data.
Write the response to the TCP socket.
Poll TCP socket
If data can be read from the socket:
Read data from the socket.
Get the next available
doca_comch_consumer_post_recvtask (this should bedoca_comch_producer_send_taskbased on context).Copy message contents into the send task's data buffer.
Submit the send task.
In this phase, the application also listens for DOCA Comch control messages from the orchestrator or a CTRL-C signal to ensure a clean shutdown.
doca_dpu_gpu_remote_offload_orchestrator
This application executes in two phases: a Control Phase and a GPU Processing Phase.
Control Phase
This phase configures the DOCA Comch client and connects to the server. It creates producers and consumers, allocates GPU memory, waits for connections, and then launches the CUDA Kernel.
Open the required DOCA device.
Open the required GPU device.
Create the DOCA Comch client.
Connect to the DOCA Comch server.
Allocate the appropriate GPU memory.
Prepare the appropriate number of Producers and Consumers.
Launch the CUDA Kernel.
GPU Processing Phase
Once the CUDA Kernel is launched, the CPU portion of the application remains running. It monitors the kernel's execution and listens for DOCA Comch control shutdown messages (from the server or a user CTRL-C signal). If a shutdown is detected, it stops the kernel and cleans up memory.
The GPU processing portion initially submits multiple post_recv buffers. It then enters a loop, polling for messages. When a message is received, it processes it (reverses the bytes), sends the response back to the server, and resubmits the buffer. This continues until a fatal error or a global stop flag is set by the CPU.
A typical CUDA thread loop is:
Poll for
post_recvmessagesIf a message is received:
Extract the data.
Verify the message is a client request.
Reverse the order of the bytes in the data.
Record the buffer in the
inflight_messagesarray.Submit the response to the server using
doca_dev_gpu_comch_producer_send.
Poll for producer send message completions
If a send completion is indicated:
Use
user_msg_idto determine which buffer was sent.Submit this buffer to receive a new message with
doca_dev_gpu_comch_consumer_post_recv.
doca_dpu_gpu_remote_offload_client
This application is simpler than the other two in the use case. Its purpose is to initiate TCP connections to the doca_dpu_gpu_remote_offload_server, using one connection per thread.
Each thread sends a specified number of request messages while recording throughput. Upon receiving a response, the client validates its content against the expected response; if they do not match, the application exits with an error.
If all requests are sent and successfully validated, the application outputs final statistics for the run, including:
Run length
Operation count
Transmit and receive data rates
IO rate
Server
The doca_dpu_gpu_remote_offload_server application leverages the following DOCA libraries:
Orchestrator
The doca_dpu_gpu_remote_offload_server application leverages the following DOCA libraries:
Client
The doca_dpu_gpu_remote_offload_client application does not leverage any DOCA libraries, however it does utilize DOCA Arg Parser.
The doca_dpu_gpu_remote_offload_orchestrator application requires CUDA version 13. Compilation will fail if an older version is used.
The doca_dpu_gpu_remote_offload_server application will only compile on the DPU, and the doca_dpu_gpu_remote_offload_orchestrator will only compile on x86 hosts. The doca_dpu_gpu_remote_offload_client application compiles on all architectures.
Compiling All Applications
To build all DOCA applications together:
cd /opt/mellanox/doca/applications/
meson /tmp/build
ninja -C /tmp/build
The applications are created under /tmp/build/dpu_gpu_remote_offload/.
Compiling Only DPU GPU Remote Offload Applications
To directly build only the storage applications:
cd /opt/mellanox/doca/applications/
meson /tmp/build -Denable_all_applications=false -Denable_dpu_gpu_remote_offload=true
ninja -C /tmp/build
Alternatively, you can edit /opt/mellanox/doca/applications/meson_options.txt:
Set
enable_all_applicationstofalse.Set
enable_dpu_gpu_remote_offloadtotrue.Run the compilation commands:
cd /opt/mellanox/doca/applications/ meson /tmp/build ninja -C /tmp/build
The applications are created under /tmp/build/dpu_gpu_remote_offload/.
The DOCA DPU GPU Remote Offload applications are provided in source form and must be compiled before execution.
Server
The server application must run on the NVIDIA BlueField DPU and must be started before the orchestrator or client.
Application Usage Instructions
Usage: doca_dpu_gpu_remote_offload_server [DOCA Flags] [Program Flags]
DOCA Flags:
-h, --help Print a help synopsis
-v, --version Print program version information
-l, --log-level Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
--sdk-log-level Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
-j, --json <path> Parse command line flags from an input json file
Program Flags:
-d, --device-id <DEV ID> Device ID (mandatory).
-r, --representor-id <REPRESENTOR ID> Representor ID (mandatory).
-c, --comch-channel-name <DEV ID> Comch channel name (optional).
-p, --server-listen-port <PORT> Server listen port (mandatory).
--cpu <CPU> CPU to use when executing data path operations (mandatory). May be repeated for more cores
--max-concurrent-messages <NUM_MESSAGES> Set the maximum number of concurrent messages that can be processed per thread (optional).
--max-message-length <LENGTH> Set the maximum length of a message that can be processed (exclusive of header) (optional).
CLI Example of the Application
./doca_dpu_gpu_remote_offload_server --device-id 03:00.0 --representor-id d8:00.0 \
-p 12345 --max-concurrent-messages 128 --cpu 1 --cpu 2
The DOCA Comch device PCIe address (03:00.0) and the representor PCIe address (d8:00.0) must match the addresses of the desired devices.
Command Line Flags
Flag Type | Short Flag | Long Flag | Description |
General flags |
|
| Prints a help synopsis |
|
| Prints program version information | |
|
| Set the log level for the application:
| |
N/A |
| Sets the log level for the program:
| |
|
| Parse command line flags from an input JSON file as well as from the CLI (if provided) | |
Program flags |
|
| Comm Channel DOCA device PCIe address. Note
This is a mandatory flag |
|
| Comm Channel DOCA device representor PCIe address. Note
This is a mandatory flag | |
|
| A custom name for the DOCA Comch connection, this must be the same on both server and orchestrator. | |
|
| The TCP port the server will use to listen for client connections. | |
N/A |
| CPU to use when executing data path operations, this option can be repeated to use multiple CPU cores in execution. Note
This is a mandatory flag | |
N/A |
| The maximum number of concurrent messages that can be processed by each thread. | |
N/A |
| The maximum length or message, in bytes, that can be supported. |
Orchestrator
The orchestrator application runs on the host CPU and must be started after the server application.
Application Usage Instructions
Usage: doca_dpu_gpu_remote_offload_orchestrator [DOCA Flags] [Program Flags]
DOCA Flags:
-h, --help Print a help synopsis
-v, --version Print program version information
-l, --log-level Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
--sdk-log-level Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
-j, --json <path> Parse command line flags from an input json file
Program Flags:
-d, --device-id <DEV ID> Device ID (mandatory).
-g, --gpu-pci-addr <PCI ADDRESS> GPU PCIe Address (mandatory).
-c, --comch-channel-name <DEV ID> Comch channel name (optional).
-t, --num-gpu-threads <NUM> Number of GPU threads to use when executing data path operations (mandatory). Must be the same as core count on server.
--max-concurrent-messages <NUM_MESSAGES> Set the maximum number of concurrent messages that can be processed per thread (optional).
--max-message-length <LENGTH> Set the maximum length of a message that can be processed (exclusive of header) (optional).
CLI Example of the Application
./doca_dpu_gpu_remote_offload_orchestrator --device-id d8:00.0 --gpu-pci-addr b5:00.0 \
--max-concurrent-messages 128 --num-gpu-threads 2
The device PCIe address (d8:00.0) and GPU PCIe address (b5:00.0) must match your devices.
The number of GPU threads should be the same as the total number of CPU cores provided to the server.
Command Line Flags
Flag Type | Short Flag | Long Flag | Description |
General flags |
|
| Prints a help synopsis |
|
| Prints program version information | |
|
| Set the log level for the application:
| |
N/A |
| Sets the log level for the program:
| |
|
| Parse command line flags from an input JSON file as well as from the CLI (if provided) | |
Program flags |
|
| Comm Channel DOCA device PCIe address. Note
This is a mandatory flag |
|
| The PCIe address of the GPU to process messages Note
This is a mandatory flag | |
|
| A custom name for the DOCA Comch connection, this must be the same on both server and orchestrator. | |
|
| The number of GPU threads to use in processing messages. This number should be the same as the total number of CPU cores assigned to the server Note
This is a mandatory flag | |
N/A |
| The maximum number of concurrent messages that can be processed by each thread. | |
N/A |
| The maximum length or message, in bytes, that can be supported. |
Client
The client must be started after the server and orchestrator have established their Comch connection.
Application Usage Instructions
Usage: doca_dpu_gpu_remote_offload_client [DOCA Flags] [Program Flags]
DOCA Flags:
-h, --help Print a help synopsis
-v, --version Print program version information
-l, --log-level Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
--sdk-log-level Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
-j, --json <path> Parse command line flags from an input json file
Program Flags:
-s, --server-ip-address <IP ADDR> Server IP address (mandatory).
-p, --server-ip-port <IP PORT> Server IP port (mandatory).
-t, --thread-count <THREAD_COUNT> Thread count (mandatory).
-i, --iteration-count <ITERATION_COUNT> Iteration count (mandatory).
-m, --message-string <MESSAGE> message string (mandatory).
-e, --expected-response <EXPECTED RESPONSE> Expected response (mandatory).
--max-concurrent-messages <NUM_MESSAGES> Set the maximum number of concurrent messages that can be processed per thread (optional).
CLI Example of the Application
./doca_dpu_gpu_remote_offload_client -s 172.17.0.1 -p 12345 -t 1 -i 1 -m "ABCD" -e "DCBA"
The GPU message response will contain the bytes in the request in reverse order.
Command Line Flags
Flag Type | Short Flag | Long Flag | Description |
General flags |
|
| Prints a help synopsis |
|
| Prints program version information | |
|
| Set the log level for the application:
| |
N/A |
| Sets the log level for the program:
| |
|
| Parse command line flags from an input JSON file as well as from the CLI (if provided) | |
Program flags |
|
|
The IP address of the Note
This is a mandatory flag |
|
|
The TCP port assigned to the Note
This is a mandatory flag | |
|
| The number of threads to use in execution. Each thread will send Note
This is a mandatory flag | |
|
| The number requests to be sent from each execution thread. Note
This is a mandatory flag | |
|
| The message to be sent to the server Note
This is a mandatory flag | |
|
| The expected response that should be received back from the server. Info
This should be the bytes of the initial message in reverse order.
Note
This is a mandatory flag | |
N/A |
| The maximum number of concurrent messages that can be processed by each thread. |