PVA Accelerated Primitives Library#

A library of device-side APIs to accomplish common operations on data resident in VMEM.

Each of the algorithms in PVA APL uses a handle prefixed with PvaApl and exposes three APIs:

pvaAplInit: for initialization of algorithm parameters which will be the same between tiles
pvaAplUpdate: for updating the algorithm to target different buffers between DMA tiles
pvaAplExec: to start the algorithm

Additionally, a global API pvaAplWait is provided which stalls the VPU until the currently running algorithm has completed.

The API is designed around asynchronous operation, allowing the user to pipeline a single APL algorithm with a concurrent VPU workload. However, true asynchronous operation depends on the PVA generation. Starting with Thor, the VPU has a coprocessor which allows APL executions to happen concurrently with the VPU. For compatibility with earlier PVA generations, the VPU will be used to synchronously run the algorithm if appropriate hardware is not available.

Each algorithm has an equivalent set of APIs with suffix ‘Vpu’. These APIs are functionally equivalent but are guaranteed to execute on the VPU, even on hardware where an asynchronous accelerator is available. Ensure that for a given handle, only one variant is used - for example if you initialized with the VPU variant, do not Exec with the generic variant.

Executing PVA APL APIs on Thor will result in a small number of reads targeting VMEM superbank D and the superbank in which the handle is allocated. The reads occur immediately after the PVA APL API is called. These reads have lower priority compared to VPU based reads and can be locked out, preventing further progress. Users should plan their VPU workload accordingly.

Additionally, reads/writes on input/output buffers can conflict with parallel VPU/DMA accesses of the same type. It is recommended to carefully choose VMEM superbanks and plan workloads to avoid stalls resulting from memory contention.

In order to link a VPU executable to PVA APL, pass pva_apl to the LIBS argument of the pva_device CMake function.

Functions#

void pvaAplWait(): Wait until the preceding algorithm execution completes.

Groups#

Harris Corner: Compute the Harris response for each input pixel in a tile.
Non-maximum suppression: Set non-maximum values in a sliding window to zero.
Separable convolution: Applies a separable convolution to a tile.

Functions#

inline void pvaAplWait()#

Wait until the preceding algorithm execution completes.

This is a blocking call. The output image should be ready when the call returns.