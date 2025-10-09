The application manages different types of traffic differently, dedicating up to 4 receive queues to each one using DOCA Flow with RSS mode to assign each packet to the right queue. The more queues the application uses, the higher is the degree of parallelism in how receive data is processed and how long it takes.

Tip It is highly recommended to use more than one receive queue for 100Gb/s or higher network traffic throughput.

If the network interface used for the application has an IP address, it is possible to ping that interface. ICMP packets are received by a dedicated CUDA kernel (file gpu_kernels/receive_icmp.cu ) which:

Receives packets using the DOCA GPUNetIO CUDA warp-level function doca_gpu_dev_eth_rxq_receive_warp . Checks if the packet is an ICMP echo request. Forwards the same packet, modifying some header info (e.g., swapping MAC and IP addresses, changing ICMP packet type). Pushes the modified packet into the send queue using the DOCA GPUNetIO thread-level function doca_gpu_dev_eth_txq_send_enqueue_strong . Sends the packet using the DOCA GPUNetIO thread-level functions doca_gpu_dev_eth_txq_commit_strong and doca_gpu_dev_eth_txq_push .

Info This is not a compute intensive use case, so a single CUDA warp with only one receive queue and one send queue is enough to keep up with a decent latency.

By default, the OS CPU ping TTL is set to 64. Therefore, to be sure the GPU is actually replying to ICMP ping requests, TTL is set to 128 in this application.

The following are motivations for this use case:

Providing an easy tool to check connectivity between packet the generator machine and the DOCA application machine

Having a sense of network latency between the two machines using a well-known tool like ping

Showing an easy way to receive and forward modified packets

Providing a warp-level implementation of a CUDA kernel receiving and forwarding traffic

Assuming the IP address of the network interface to ping is 192.168.1.1 , this is the expected output:

Copy Copied! $ ping 192.168.1.1 PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data. 64 bytes from 192.168.1.1: icmp_seq=1 ttl=64 time =0.324 ms 64 bytes from 192.168.1.1: icmp_seq=2 ttl=64 time =0.332 ms 64 bytes from 192.168.1.1: icmp_seq=3 ttl=64 time =0.299 ms 64 bytes from 192.168.1.1: icmp_seq=4 ttl=64 time =0.309 ms 64 bytes from 192.168.1.1: icmp_seq=5 ttl=64 time =0.323 ms 64 bytes from 192.168.1.1: icmp_seq=6 ttl=64 time =0.300 ms 64 bytes from 192.168.1.1: icmp_seq=7 ttl=64 time =0.274 ms 64 bytes from 192.168.1.1: icmp_seq=8 ttl=64 time =0.314 ms 64 bytes from 192.168.1.1: icmp_seq=9 ttl=64 time =0.327 ms 64 bytes from 192.168.1.1: icmp_seq=10 ttl=64 time =0.384 ms 64 bytes from 192.168.1.1: icmp_seq=11 ttl=128 time =0.346 ms 64 bytes from 192.168.1.1: icmp_seq=12 ttl=128 time =0.274 ms 64 bytes from 192.168.1.1: icmp_seq=13 ttl=128 time =0.294 ms 64 bytes from 192.168.1.1: icmp_seq=14 ttl=128 time =0.240 ms 64 bytes from 192.168.1.1: icmp_seq=15 ttl=128 time =0.273 ms 64 bytes from 192.168.1.1: icmp_seq=16 ttl=128 time =0.238 ms 64 bytes from 192.168.1.1: icmp_seq=17 ttl=128 time =0.252 ms 64 bytes from 192.168.1.1: icmp_seq=18 ttl=128 time =0.232 ms 64 bytes from 192.168.1.1: icmp_seq=19 ttl=128 time =0.278 ms ......

A DOCA Progress Engine is attached to the DOCA Ethernet Txq context used to forward ICMP packets. Those packets are sent from the GPU with the DOCA_GPU_SEND_FLAG_NOTIFY flag, which result in creating a notification after every packet is sent by the NIC.

All the notifications are then analyzed by the CPU through the doca_pe_progress function. The final effect is the output of the application which returns the distance, in seconds, between two pings. The following is an example with a ping every 0.5 seconds:

Copy Copied! $ ping -i 0.5 192.168.1.1 PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data. 64 bytes from 192.168.1.1: icmp_seq=1 ttl=128 time =0.202 ms 64 bytes from 192.168.1.1: icmp_seq=2 ttl=128 time =0.179 ms 64 bytes from 192.168.1.1: icmp_seq=3 ttl=128 time =0.199 ms 64 bytes from 192.168.1.1: icmp_seq=4 ttl=128 time =0.180 ms 64 bytes from 192.168.1.1: icmp_seq=5 ttl=128 time =0.200 ms 64 bytes from 192.168.1.1: icmp_seq=6 ttl=128 time =0.189 ms ......

On the DOCA side, the application should print a log for all the ICMP packets received and retransmitted:

Copy Copied! Seconds 5 [UDP] QUEUE: 0 DNS: 0 OTHER: 0 TOTAL: 0 [TCP] QUEUE: 0 HTTP: 0 HTTP HEAD: 0 HTTP GET: 0 HTTP POST: 0 TCP [SYN: 0 FIN: 0 ACK: 0] OTHER: 0 TOTAL: 0 [13:54:19:202061][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 3 sent at 1702302859201997120 time from last ICMP is 0.512025 sec [13:54:19:713960][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 4 sent at 1702302859713896620 time from last ICMP is 0.511899 sec [13:54:20:225891][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 5 sent at 1702302860225868072 time from last ICMP is 0.511971 sec [13:54:20:737823][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 6 sent at 1702302860737781760 time from last ICMP is 0.511914 sec [13:54:21:249763][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 7 sent at 1702302861249723044 time from last ICMP is 0.511941 sec [13:54:21:761614][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 8 sent at 1702302861761588848 time from last ICMP is 0.511866 sec [13:54:22:273689][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 9 sent at 1702302862273643536 time from last ICMP is 0.512055 sec [13:54:22:785543][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 10 sent at 1702302862785527576 time from last ICMP is 0.511884 sec [13:54:23:297545][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 11 sent at 1702302863297501448 time from last ICMP is 0.511974 sec [13:54:23:809406][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 12 sent at 1702302863809350664 time from last ICMP is 0.511849 sec Seconds 10 [UDP] QUEUE: 0 DNS: 0 OTHER: 0 TOTAL: 0 [TCP] QUEUE: 0 HTTP: 0 HTTP HEAD: 0 HTTP GET: 0 HTTP POST: 0 TCP [SYN: 0 FIN: 0 ACK: 0] OTHER: 0 TOTAL: 0 [13:54:24:321405][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 13 sent at 1702302864321391148 time from last ICMP is 0.512040 sec [13:54:24:833338][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 14 sent at 1702302864833270356 time from last ICMP is 0.511879 sec [13:54:25:345302][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 15 sent at 1702302865345282728 time from last ICMP is 0.512012 sec [13:54:25:857199][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 16 sent at 1702302865857133664 time from last ICMP is 0.511851 sec [13:54:26:369131][2688665][DOCA][INF][gpu_packet_processing.c:77][debug_send_packet_icmp_cb] ICMP debug event: Queue 0 packet 17 sent at 1702302866369128728 time from last ICMP is 0.511995 sec......





This is the most generic use case of receive-and-analyze packet headers. Designed to keep up with 100Gb/s of incoming network traffic, the CUDA kernel responsible for the UDP traffic dedicates one CUDA block of 512 CUDA threads (file gpu_kernels/receive_udp.cu ) to a different Ethernet UDP receive queue.

The data path loop is:

Receive packets using the DOCA GPUNetIO CUDA block-level function doca_gpu_dev_eth_rxq_receive_block . Each CUDA thread works on a subset of received packets. DOCA buffer containing the packet is retrieved. Packet payload is analyzed to differentiate between DNS packets from other UDP generic packets. Packet payload is wiped-out to ensure that old stale packets are not analyzed again. Each CUDA block reports to the CPU thread statistics about types of received packets through a DOCA GPUNetIO semaphore. CPU thread polls on semaphores to retrieve and print the statistics to the console.

The motivation for this use case is mostly to provide an application template to:

Receive and analyze packet headers to differentiate across different UDP protocols

Report statistics to the CPU through the DOCA GPUNetIO semaphore

Several well-known packet generators can be used to test this mode like T-Rex or DPDK testpmd.

By default, the TCP flow management is the same as UDP: Receive TCP packets and analyze their headers to report to the CPU statistics about the types of received packets. This is good for passive traffic analyzers or sniffers but sometimes a packet processing application requires receiving packets directly from TCP peers which implies the establishment of a TCP-reliable connection through the 3-way handshake method. Therefore, it is possible to enable TCP "server" mode through the -s command-line flag which enables an "HTTP echo server" mode where the CPU and GPU cooperate to establish a TCP connection and process TCP data packets.

Specifically, in this case there are two different sets of receive queues:

CPU DPDK receive queues which receive TCP "control" packets (e.g. SYN, FIN or RST)

DOCA GPUNetIO receive queues to receive TCP "data" packets

This distinction is possible thanks to DOCA Flow capabilities.

The application's flow requires CPU and GPU collaboration as described in the following subsections.

A CPU thread through DPDK queues receives a TCP SYN packet from a remote TCP peer. The CPU thread establishes a TCP reliable connection (replies with a TCP SYN-ACK packet) with the peer and uses DOCA Flow to create a new steering rule to redirect TCP data packets to one of the DOCA GPUNetIO receive queues. The new steering rule excludes control packets (e.g., SYN, FIN or RST).

The CUDA kernel responsible for TCP processing receives TCP data packets and performs TCP packet header analysis. If it receives an HTTP GET request, it stores the relevant packet's info in the next item of a DOCA GPUNetIO semaphore, setting it to READY .

A second CUDA kernel responsible for HTTP processing polls the DOCA GPUNetIO semaphore. Once it detects the update of the next item to READY , it reads the HTTP GET packet info and crafts an HTTP response packet with an HTML page.

If the request is about index.html or contacts.html , the CUDA kernel replies with the appropriate HTML page using a 200 OK code. For all other requests, the it returns a "Page not found" and 404 Error code.

HTTP response packets are sent by this second HTTP CUDA kernel using DOCA GPUNetIO.

Note Care must be taken to maintain TCP sequence/ack numbers in the packet headers.





If the CPU receives a TCP FIN packet through the DPDK queues, it closes the connection with the remote TCP peer and removes the DOCA Flow rule from the DOCA GPUNetIO queues so the CUDA kernel cannot receive anymore packets from that TCP peer.

Motivations for this use case:

Receiving and analyzing packet headers to differentiate across different TCP protocols

Processing TCP packets on GPU in passive mode (sniffing) and active mode (reliable connection)

Having a DOCA-DPDK application able to establish a TCP reliable connection without using any OS socket and bypassing kernel routines

Having CUDA-kernel-to-CUDA-kernel communication through a DOCA GPUNetIO semaphore

Showing how to create and send a packet from scratch with DOCA GPUNetIO

Assuming the network interface used to run the application has the IP address 192.168.1.1 , it is possible to test this HTTP echo server mode using simple tools like curl or wget .

Example with curl :