The NVIDIA DOCA framework is the key for unlocking the potential of the BlueField DPU.

DOCA's software environment allows developers to program the DPA to accelerate workloads. Specifically, DOCA includes:

DOCA DPA SDK – a high-level SDK for application-level protocol acceleration

DOCA FlexIO SDK – a low-level SDK to load DPA programs into the DPA, manage the DPA memory, create the execution handlers and the needed hardware rings and contexts

DPACC – DPA toolchain for compiling and ELF file manipulation of the DPA code

The DPA is intended to accelerate datapath operations for the DPU and host CPU. The accelerated portion of the application using DPA is presented as a library for the host application. The code within the library is invoked in an event-driven manner in the context of a process that is running on the DPA. One or many DPA execution units may work to handle the work associated with network events. The programmer specifies different conditions when each function should be called using the appropriate SDK APIs on the host or DPU.

The DPA cannot be used as a standalone CPU.

Management of the DPA, such as loading processes and allocating memory, is performed from a host or DPU process. The host process discovers the DPA capabilities on the device and drives the control plane to set up the different DPA objects. The DPA objects exist as long as the host process exists. When the host process is destroyed, the DPA objects are freed. The host process decides which functions it wants to accelerate using the DPA: Either its entire data plane or only a part of it.

The following diagram illustrates the different processes that exist in the system:

DPACC is a compiler for the DPA processor. It compiles code targeted for the DPA processor into an executable and generates a DPA program. A DPA program is a host library with interfaces encapsulating the DPA executable.

This DPA program is linked with the host application to generate a host executable. The host executable can invoke the DPA code through the DPA SDK's runtime.

DPACC implements the following keywords:

Keyword Application Usage Comment __dpa_global__ Annotate all event handlers that execute on the DPA and all common user-defined datatypes (including user-defined structures) which are passed from the host to the DPA as arguments. Used by the compiler to generate entry points in the DPA executable and automatically replicate user-defined datatypes between the host and DPA. __dpa_rpc__ Annotate all RPC calls which are invoked by the host and execute on the DPA. RPC calls return a value of uint64_t . Used by the compiler to generate RPC specific entry points.

Please refer to NVIDIA DOCA DPACC Compiler for more details.

Supported at beta level.

FlexIO is a low-level event-driven library to program and accelerate functions on the DPA.

To load an application onto the DPA, the user must create a process on the DPA, called a FlexIO process. FlexIO processes are isolated from each other like standard host OS processes.

FlexIO supports the following options for executing a user-defined function on the DPA:

FlexIO event hander – the event handler executes its function each time an event occurs. An event on this context is a completion event (CQE) received on the NIC completion queue (CQ) when the CQ was in the armed state. The event triggers an internal DPA interrupt that activates the event handler. When the event handler is activated, it is provided with a user-defined argument. The argument in most cases is a pointer to the software execution context of the event handler. The following pseudo-code example describes how to create an event handler and attach it to a CQ: Copy Copied! // Device code __dpa_global__ void myFunc(flexio_uintptr_t myArg){ struct my_db *db = (struct my_db *)myArg; get_completion(db->myCq) work(); arm_cq(myCq); // reschedule the thread flexio_dev_thread_reschedule(); } // Host code main() { /* Load the application code into the DPA */ flexio_process_create(device, application, &myProcess); /* Create event handler to run my_func with my_arg */ flexio_event_handler_create(myProcess, myFunc, myArg, &myEventHandler); /* Associate the event hanlder with a specific CQ */ create_cq(&myCQ,… , myEventHandler) /* Start the event handler */ flexio_event_handler_run(myEventHandler) … } RPC – remote, synchronous, one-time call of a specific function. RPC is mainly used for the control path to update DPA memory contexts of a process. The RPC's return value is reported back to the host application. The following pseudo-code example describes how to use the RPC: Copy Copied! // Device code __dpa_rpc__ uint64_t myFunc(myArg) { struct my_db *db = (struct my_db *)myArg; if (db->flag) return 1; db->flag = 1; return 0; } // Host code main() { … /* Load the application code into the DPA */ flexio_process_create(device, application, &myProcess); /* run the function */ flexio_process_call(myProcess, myFunc, myArg, &returnValue); … }

The DPA process can access several memory locations:

Global variables defined in the DPA process.

Stack memory – local to the DPA execution unit. Stack memory is not guaranteed to be preserved between different execution of the same handler.

Heap memory – this is the process' main memory. The heap memory contents are preserved as long as the DPA process is active.

External registered memory – remote to the DPA but local to the server. The DPA can access any memory location that can be registered to the local NIC using the provided API. This includes BlueField DRAM, external host DRAM, GPU memory, and more.

The heap and external registered memory locations are managed from the host process. The DPA execution units can load/store from stack/heap and external memory locations. Note that for external memory locations, the window should be configured appropriately using FlexIO Window APIs.

FlexIO allows the user to allocate and populate heap memory on the DPA. The memory can later be used by in the DPA application as an argument to the execution context (RPC and event handler):

Copy Copied! /* Load the application code into the DPA */ flexio_process_create(device, application, &myProcess); /* allocate some memory */ flexio_buf_dev_alloc(process, size, ptr) /* populate it with user defined data */ flexio_host2dev_memcpy(process, src, size, ptr) /* run the function */ flexio_process_call(myProcess, function, ptr, &return value);

FlexIO allows accessing external registered memory from the DPA execution units using FlexIO Window. FlexIO Window maps a memory region from the DPA process address space to an external registered memory. A memory key for the external memory region is required to be associated with the window. The memory key is used for address translation and protection. FlexIO window is created by the host process and is configured and used by the DPA handler during execution. Once configured, LD/ST from the DPA execution units access the external memory directly.

The access for external memory is not coherent. As such, an explicit memory fencing is required to flush the cached data to maintain consistency. See section "Memory Fences" for more.

The following example code demonstrates the window management:

Collapse Source Copy Copied! // Device code __dpa_rpc__ uint64_t myFunc(arg1, arg2, arg3) { struct flexio_dev_thread_ctx *dtctx; flexio_dev_get_thread_ctx(&dtctx); uint32_t windowId = arg1; uint32_t mkey = arg2; uint64_t *dev_ptr; flexio_dev_window_config(dtctx, windowId, mkey ); /* get ptr to the external memory (arg3) from the DPA process address space */ flexio_dev_status status = flexio_dev_window_ptr_acquire (dtctx, arg3, dev_ptr); /* will set the external memory */ *dev_ptr = 0xff; /* flush the data out */ __dpa_thread_window_writeback(); return 0; } // Host code main() { /* Load the application code into the DPA */ flexio_process_create(device, application, &myProcess); /* define an array on host */ uint64_t var= {0}; /* register host buffer */ mkey =ibv_reg_mr(&var, …) /* create the window */ flexio_window_create(process, doca_device->pd, mkey, &window_ctx); /* run the function */ flexio_process_call(myProcess, myFunc, flexio_window_get_id(window_ctx), mkey, &var, &returnValue); }





A DPA process can initiate send and receive operations using the FlexIO outbox object. The FlexIO outbox contains memory-mapped IO registers that enable the DPA application to issue device doorbells to manage the send and receive planes. The DPA outbox can be configured during run time to perform send and receive from a specific NIC function exposed by the DPU. This capability is not available for Host CPUs that can only access their assigned NIC function.

Each DPA execution engine has its own outbox. As such, each handler can efficiently use the outbox without needing to lock to protect against accesses from other handlers. To enforce the required security and isolation, the DPA outbox enables the DPA application to send and receive only for queues created by the DPA host process and only for NIC functions the process is allowed to access.

Like the FlexIO window, the FlexIO outbox is created by the host process and configured and used at run time by the DPA process.

Collapse Source Copy Copied! // Device code __dpa_rpc__ uint64_t myFunc(arg1,arg2,arg3) { struct flexio_dev_thread_ctx *dtctx; flexio_dev_get_thread_ctx(&dtctx); uint32_t outbox = arg1; flexio_dev_outbox_config (dtctx, outbox); /* Create some wqe and post it on sq */ /* Send DB on sq*/ flexio_dev_qp_sq_ring_db(dtctx, sq_pi,arg3); /* Poll CQ (cq number is in arg2) */ return 0; } // Host code main() { /* Load the application code into the DPA */ flexio_process_create(device, application, &myProcess); /* Allocate uar */ uar = ibv_alloc_uar(ibv_ctx); /* Create queues*/ flexio_cq_create(myProcess, ibv_ctx, uar, cq_attr, &myCQ); my_hwcq = flexio_cq_get_hw_cq (myCQ); flexio_sq_create(myProcess, ibv_ctx, myCQ, uar, sq_attr, &mySQ); my_hwsq = flexio_sq_get_hw_sq(mySQ); /* Outbox will allow access only for queues created with the same UAR*/ flexio_outbox_create(process, ibv_ctx, uar, &myOutbox); /* Run the function */ flexio_process_call(myProcess, myFunc, myOutbox, my_hwcq->cq_num, my_hwsq->sq_num, &return_value); }





The DPA execution units support atomic instructions to protect from concurrent access to the DPA process heap memory. Using those instructions, multiple synchronization primitives can be designed.

FlexIO currently supports basic spin lock primitives. More advanced thread pipelining can be achieved using DOCA DPA events.

Supported at alpha level.

The DOCA DPA SDK eases DPA code management by providing high-level primitives for DPA work offloading, synchronization, and communication. This leads to simpler code but lacks the low-level control that FlexIO SDK provides.

User-level applications and libraries wishing to utilize the DPA to offload their code may choose DOCA DPA. Use-cases closer to the driver level and requiring access to low-level NIC features would be better served using FlexIO.

The implementation of DOCA DPA is based on the FlexIO API. The higher level of abstraction enables the user to focus on their program logic and not the low-level mechanics.

The work submission APIs enable a host application to invoke a function on the DPA and supply it with arguments. The work is executed in an asynchronous manner only when the work's dependencies are satisfied. Using this model, the user can define an arbitrary sequence of self-triggering work on the DPA. To borrow common CUDA terminology, these functions are called "kernels". This frees up the host's CPU to focus on its tasks after submitting the list of work to the DPA.

The following is an example where the host submits three functions, func1 , func2 , and func3 that execute one after the other. The functions are chained using a DPA sync event, which is an abstract data type that contains a 64-bit counter.

Copy Copied! doca_dpa_kernel_launch_update_add(dpa, event, 0 , event, 1 , nthreads, func1, <args>); doca_dpa_kernel_launch_update_add(dpa, event, 1 , event, 1 , nthreads, func2, <args>); doca_dpa_kernel_launch_update_add(dpa, event, 2 , event, 1 , nthreads, func3, <args>);





For more information, refer to DOCA Sync Event in the DOCA Core.

The previous example demonstrates how events can be used to chain functions together for execution on the DPA. In addition to triggered scheduling, events can be directly signaled by either the CPU, GPU, DPA, or by remote nodes. This provides flexibility of coordinating work on the DPA.

The following are some examples of how events can be used:

Signaling and waiting from CPU (host or DPU's Arm) – the CPU thread signals the event while using its counter. The event can control the execution flow on the DPA. Using the wait operation, the CPU thread can wait in either polling or blocking mode until the corresponding event is signaled. CPU signals an event: Copy Copied! doca_sync_event_update_<add|set>(event, value) CPU waits for an event: Copy Copied! doca_sync_event_wait_<gt|gt_yield>(event, value, mask)

Signaling from the DPA from within a kernel – the event is written to in the user's kernel during its execution. When waiting, the DPA kernel thread waits until the event value is satisfied (e.g., greater than). DPA kernel signals an event: Copy Copied! doca_dpa_dev_sync_event_update_<add|set>(event, value) DPA kernel waits for an event: Copy Copied! doca_dpa_dev_sync_event_wait_gt(event, value, mask)

Signaling from remote nodes – the event is written by the remote side. The remote node may update the count of a sync event in the target node that it has connected to via an RDMA handle. Remote node signals an event: Copy Copied! doca_dpa_dev_rdma_signal_<add|set>(rdma, event, count);



The following example demonstrates how to construct a pipeline of functions on the DPA using events:

Copy Copied! /* Host */ main() { // create event for usage on DPA, i.e., subscriber=DPA, publisher=DPA doca_sync_event_create(&event); doca_sync_event_publisher_add_location_dpa(event, ctx); doca_sync_event_subscriber_add_location_dpa(event, ctx); doca_sync_event_start(event); // export a handle representing the event for DPA to use doca_sync_event_export_to_dpa(event, ctx, &event_handle) } /* DPA: func1 -> func2 */ __dpa_global__ func1(args) { work1(); // signal next thread by adding to event counter doca_dpa_dev_sync_event_update_add(event_handle, 1); } __dpa_global__ func2(args) { // wait for event counter to be greater than 0 (i.e. 1) doca_dpa_dev_sync_event_wait_gt(event_handle, 0, ...); work2(); // signal next thread by adding to event counter doca_dpa_dev_sync_event_update_add(event_handle, 1); }

The DPA program can access several memory spaces using the provided APIs. The following presents models to access DPA process heap, host memory, GPU memory, and NIC device memory:

DPA process heap – this is the DPA process' main memory. The memory may either be in the stack or on the heap. Heap allocations must be obtained using the doca_dpa_mem_alloc() API. The low-level memory model in this space is determined by the processor architecture.

Host memory – this is the address space of the host program. Handles to this memory space can be obtained by DOCA Buff Array API. DMA access to this space is provided from the DPA using the doca_dpa_dev_memcpy() routine. For more information, refer to DOCA Core Inventories.

Communication APIs are currently implemented for InfiniBand only.

The communication APIs enable the application to initiate and be a target of remote memory operations. The communication APIs are implemented over RDMA transport on InfiniBand.

All communications are initiated on an RDMA context. An RDMA context is an opaque representation of a queue pair (QP). RDMAs can be either Reliable Connected or Reliable Dynamic Connected Transport. RDMAs are created on the host-side of the application and then a handle to the RDMA can be passed to the DPA-side of the application.

The following code demonstrates a ping-pong operation between two processes, each with their code offloaded to a DPA. The program uses remote memory access semantic to transfer host memory from one node to the other. The DPA initiates the data transfer and detects its completion.

For more information related to the following code section, refer to DOCA Core Inventories and DOCA RDMA.

Collapse Source Copy Copied! /* Host */ main() { // Create a rdma instance doca_rdma_create(doca_dev, &rdma); // Set rdma datapath to dpa doca_ctx_set_datapath_on_dpa(rdma_as_doca_ctx, dpa_ctx); // Set other rdma attributes // Start rdma ctx doca_ctx_start(rdma_as_doca_ctx); // Export rdma to get connection info to pass to the remote side doca_rdma_export(rdma, &connection_info); // Application does out-of-band passing of address to remote side // Assume connection info is now in 'rem_rdma_info' // Connect my rdma to the remote rdma doca_rdma_connect(rdma, rem_rdma_info); // Get the rdma dpa handle to pass to the kernel doca_rdma_get_dpa_handle(rdma, &rdma_dpa_handle); // Allocate local buffer in host memory malloc(&local_buf, size); // Register buffer for remote access and obtain an object representing the memory doca_mmap_create(&local_mmap); doca_mmap_set_memrange(local_mmap, local_buf, size); // Obtain memory export to pass to remote side doca_mmap_export_rdma(local_mmap, &mem_export); // Create buff inventory from local mmap doca_buf_arr_create(local_mmap, &buf_arr); // Set the target of memory representor to DPA doca_buf_arr_set_target_dpa(buf_arr, dpa_ctx); // Get DPA handle to pass to kernel doca_buf_arr_start(buf_arr); doca_buf_arr_get_dpa_handle(buf_arr, &buf_arr_dpa_handle); // Allocate an event that can be signaled by the remote side // to indicate that message is ready to be read doca_sync_event_create(&event); doca_sync_event_publisher_add_location_dpa(event, ctx); doca_sync_event_subscriber_add_location_dpa(event, ctx); doca_sync_event_start(event); // Obtain a handle to event that can be passed to the remote side doca_sync_event_export_remote(event, &remote_event_handle); // OOB Pass the remote event handle to the other side // OOB Pass mmap export to remote side … } /* DPA */ func() { // Write contents of local_buf to remote_buf doca_dpa_dev_rdma_write(rdma_handle, remote_buf, local_buf, size); // add `1' atomically to remote_event_handle doca_dpa_dev_rdma_signal_set(rdma_handle, remote_event, 1); // Wait for my partner (remote node) write to complete doca_dpa_dev_sync_event_wait_gt(local_event, 0); […] }

The DPA offers a coherent but weakly ordered memory model. The application is required to use fences to impose the desired memory ordering. Additionally, where applicable, the application is required to write back data for the data to be visible to NIC engines (see the coherency table).

The memory model offers "same address ordering" within a thread. This means that, if a thread writes to a memory location and subsequently reads that memory location, the read returns the contents that have previously been written.

The memory model offers 8-byte atomicity for aligned accesses to atomic datatypes. This means that all eight bytes of read and write are performed in one indivisible transaction.

The DPA does not support unaligned accesses, such as accessing N bytes of data from an address not evenly divisible by N .

The DPA processes memory can be divided into the following memory spaces:

Memory Space Definition Heap Memory locations within the DPA process heap. Referenced as __DPA_HEAP in the code. Memory Memory locations belonging to the DPA process (including stack, heap, BSS and data segment) except the memory-mapped IO. Referenced as __DPA_MEMORY in the code. MMIO (memory-mapped I/O) External memory outside the DPA process accessed via memory-mapped IO. Window and Outbox accesses are considered MMIO. Referenced as __DPA_MMIO in the code. System All memory locations accessible to the thread within Memory and MMIO spaces as described above. Referenced as __DPA_SYSTEM in the code.

The coherency between the DPA threads and NIC engines is described in the following table:

Producer Observer Coherency Comments DPA thread NIC engine Not coherent Data to be read by the NIC must be written back using the appropriate intrinsic (see section "Memory Fence and Cache Control Usage Examples"). NIC engine DPA Thread Coherent Data written by the NIC is eventually visible to the DPA threads. The order in which the writes are visible to the DPA threads is influenced by the ordering configuration of the memory region (see IBV_ACCESS_RELAXED_ORDERING ). In a typical example of the NIC writing data and generating a completion entry (CQE), it is guaranteed that when the write to the CQE is visible, the DPA thread can read the data without additional fences. DPA thread DPA thread Coherent Data written by a DPA thread is eventually visible to the other DPA threads without additional fences. The order in which writes made by a thread are visible to other threads is undefined when fences are not used. Programmers can enforce ordering of updates using fences (see section "Memory Fences") .

Fence APIs are intended to impose memory access ordering. The fence operations are defined on the different memory spaces. See information on memory spaces under section "Memory Model".

The fence APIs apply ordering between the operations issued by the calling thread. As a performance note, the fence APIs also have a side effect of writing back data to the memory space used in the fence operation. However, programmers should not rely on this side effect. See section "Cache Control" for explicit cache control operations. The fence APIs have an effect of a compiler-barrier which means that memory accesses are not reordered around the fence API invocation by the compiler.

A fence applies between the "predecessor" and the "successor" operations. The predecessor and successor ops can be refenced using __DPA_R , __DPA_W , and __DPA_RW in the code.

The generic memory fence operation can operate on any memory space and any set of predecessor and successor operations. The other fence operations are provided as convenient shortcuts that are specific to the use case. It is preferable for programmers to use the shortcuts when possible.

Fence operations can be included using the dpaintrin.h header file.

Copy Copied! void __dpa_thread_fence(memory_space, pred_op, succ_op);

This fence can apply to any DPA thread memory space. Memory spaces are defined under section "Memory Model". The fence ensures that all operations ( pred_op ) performed by the calling thread, before the call to __dpa_thread_fence() , are performed and made visible to all threads in the DPA, host, NIC engines, and peer devices as occurring before all operations ( succ_op ) to the memory space after the call to __dpa_thread_fence() .

Copy Copied! void __dpa_thread_system_fence();

This is equivalent to calling __dpa_thread_fence(__DPA_SYSTEM, __DPA_RW, __DPA_RW) .

Copy Copied! void __dpa_thread_outbox_fence(pred_op, succ_op);

This is equivalent to calling __dpa_thread_fence(__DPA_MMIO, pred_op, succ_op) .

Copy Copied! void __dpa_thread_window_fence(pred_op, succ_op);

This is equivalent to calling __dpa_thread_fence(__DPA_MMIO, pred_op, succ_op) .

Copy Copied! void __dpa_thread_memory_fence(pred_op, succ_op);

This is equivalent to calling __dpa_thread_fence(__DPA_MEMORY, pred_op, succ_op) .

Cache control operations allow the programmer to exercise fine-grained control over data resident in the DPA's caches. They have an effect of a compiler-barrier. The operations can be included using the dpaintrin.h header file.

Copy Copied! void __dpa_thread_window_read_inv();

The DPA can cache data that was fetched from external memory using a window. Subsequent memory accesses to the window memory location may return the data that is already cached. In some cases, it is required by the programmer to force a read of external memory (see example under "Polling Externally Set Flag"). In such a situation, the window read contents cached must be dropped.

This function ensures that contents in the window memory space of the thread before the call to __dpa_thread_window_read_inv() are invalidated before read operations made by the calling thread after the call to __dpa_thread_window_read_inv() .

Copy Copied! void __dpa_thread_window_writeback();

Writes to external memory must be explicitly written back to be visible to external entities.

This function ensures that contents in the window space of the thread before the call to __dpa_thread_window_writeback() are performed and made visible to all threads in the DPA, host, NIC engines, and peer devices as occurring before any write operation after the call to __dpa_thread_window_writeback() .

Copy Copied! void __dpa_thread_memory_writeback();

Writes to DPA memory space may need to be written back. For example, the data must be written back before the NIC engines can read it. Refer to the coherency table for more.

This function ensures that the contents in the memory space of the thread before the call to __dpa_thread_writeback_memory() are performed and made visible to all threads in the DPA, host, NIC engines, and peer devices as occurring before any write operation after the call to __dpa_thread_writeback_memory() .

These examples illustrate situations in which programmers must use fences and cache control operations.

In most situations, such direct usage of fences is not required by the application using FlexIO or DOCA DPA SDKs as fences are used within the APIs.

In this example, a thread on the DPA prepares a work queue element (WQE) that is read by the NIC to perform the desired operation.

The ordering requirement is to ensure the WQE data contents are visible to the NIC engines read it. The NIC only reads the WQE after the doorbell (MMIO operation) is performed. Refer to coherency table.

# User Code – WQE Present in DPA Memory Comment 1 Write WQE Write to memory locations in the DPA (memory space = __DPA_MEMORY ) 2 __dpa_thread_memory_writeback(); Cache control operation 3 Write doorbell MMIO operation via Outbox

In some cases, the WQE may be present in external memory. See the description of flexio_qmem above. The table of operations in such a case is below.

# User Code – WQE Present in External Memory Comment 1 Write WQE Write to memory locations in the DPA (memory space = __DPA_MMIO ) 2 __dpa_thread_window_writeback(); Cache control operation 3 Write doorbell MMIO operation via Outbox

In this example, a thread on the DPA is writing a WQE for a receive queue and advancing the queue's producer index. The DPA thread will have to order its writes and writeback the doorbell record contents so that the NIC engine can read the contents.

# User Code – WQE Present in DPA Memory Comment 1 Write WQE Write to memory locations in the DPA (memory space = __DPA_MEMORY ) 2 __dpa_thread_memory_fence(__DPA_W, __DPA_W); Order the write to the doorbell record with respect to WQE 3 Write doorbell record Write to memory locations in the DPA (memory space = __DPA_MEMORY ) 4 __dpa_thread_memory_writeback(); Ensure that contents of doorbell record are visible to the NIC engine

In this example, a thread on the DPA is polling on a flag that will be updated by the host or other peer device. The memory is accessed by the DPA thread via a window. The DPA thread must invalidate the contents so that the underlying hardware performs a read.

User Code – Flag Present in External Memory Comment Copy Copied! while (!flag) { __dpa_thread_window_read_inv(); } flag is a memory location read using a window

In this example, a thread on the DPA is writing a data value and communicating that the data is written to another thread via a flag write. The data and flag are both in DPA memory.

User Code – Thread 1 User Code – Thread 2 Comment Initial condition, flag = 0 var1 = x; while(*((volatile int *)&flag) !=1); Thread 1 - write to var1

Thread 2 - flag is accessed as a volatile variable, so the compiler preserves the intended program order of reads __dpa_thread_memory_fence(__DPA_W, __DPA_W); Thread 1 – write to flag cannot bypass write to var1 var_t2 = var1; flag = 1; assert(var_t2 == x); var_t2 must be equal to x

In this example, a thread on the DPA sets a flag that is observed by a peer device. The flag is written using a window.

User Code – Flag Present in External Memory Comment flag = data; flag is updated in local DPA memory __dpa_thread_window_writeback(); Contents from DPA memory for the window are written to external memory

In this example, a thread on the DPA reads a NIC completion queue and updates its consumer index.

First, the DPA thread polls the memory location for the next expected CQE. When the CQE is visible, the DPA thread processes it. After processing is complete, the DPA thread updates the CQ's consumer index. The consumer index is read by the NIC to determine whether a completion queue entry has been read by the DPA thread. The consumer index is used by the NIC to monitor a potential completion queue overflow situation.

User Code – CQE in DPA Memory Comment while(*((volatile uint8_t *)&cq→op_own) & 0x1 == hw_owner); Poll CQ owner bit in DPA memory until the value indicates the CQE is in software ownership. Coherency model ensures update to the CQ is visible to the DPA execution unit without additional fences or cache control operations. Coherency model ensures that data in the CQE or referenced by it are visible when the CQE changes ownership to software. process_cqe(); User processes the CQE according to the application's logic. cq→cq_index++; // next CQ index. Handle wraparound if necessary Calculate the next CQ index taking into account any wraparound of the CQ depth. update_cq_dbr(cq, cq_index); // writes cq_index to DPA memory Memory operation to write the new consumer index. __dpa_thread_memory_writeback(); Ensures that write to CQ's consumer index is visible to the NIC. Depending on the application's logic, the __dpa_thread_memory_writeback() may be coalesced or eliminated if the CQ is configured in overrun ignore mode. arm_cq(); Arm the CQ to generate an event if this handler is going to call flexio_dev_thread_reschedule() . Arming the CQ is not required if the handler calls flexio_dev_thread_finish() .

The DPA supports some platform-specific operations. These can be accessed using the functions described in the following subsections. The operations can be included using the dpaintrin.h header file.

Copy Copied! uint64_t __dpa_thread_cycles();

Returns a counter containing the number of cycles from an arbitrary start point in the past on the execution unit the thread is currently scheduled on.

Note that the value returned by this function in the thread is meaningful only for the duration of when the thread remains associated with this execution unit.

This function also acts as a compiler barrier, preventing the compiler from moving instructions around the location where it is used.

Copy Copied! uint64_t __dpa_thread_time();

Returns the number of timer ticks from an arbitrary start point in the past on the execution unit the thread is currently scheduled on.

Note that the value returned by this function in the thread is meaningful only for the duration of when the thread remains associated with this execution unit.

This intrinsic also acts as a compiler barrier, preventing the compiler from moving instructions around the location where the intrinsic is used.

Copy Copied! uint64_t __dpa_thread_inst_ret();

Returns a counter containing the number of instructions retired from an arbitrary start point in the past by the execution unit the thread is currently scheduled on.

Note that the value returned by this function in the software thread is meaningful only for the duration of when the thread remains associated with this execution unit.

This intrinsic also acts as a compiler barrier, preventing the compiler from moving instructions around the location where the intrinsic is used.

Copy Copied! int __dpa_fxp_log2(unsigned int );

This function evaluates the fixed point Q16.16 base 2 logarithm. The input is an unsigned integer.

Copy Copied! int __dpa_fxp_rcp( int );

This function evaluates the fixed point Q16.16 reciprocal (1/x) of the value provided.

Copy Copied! int __dpa_fxp_pow2( int );

This function evaluates the fixed point Q16.16 power of 2 of the provided value.