VPI is used to implement asynchronous computing pipelines suited for real-time image processing applications. Pipelines are composed of one or more asynchronous compute streams that run algorithms on buffers in the available compute backends. Synchronization between streams is done using events.

All of these elements are described below.

Streams

A VPIStream is an asynchronous queue that executes algorithms in sequence on a given backend device. To achieve high degree of parallelism across backends, a given processing pipeline can be configured with several processing stages running concurrently, each one in its VPI stream. VPI streams can collaborate with each other by exchanging data structures with the help of synchronization primitives provided by VPI.

Backends

A backend comprises the compute hardware that ultimately runs an algorithm. VPI supports the backends CPU, GPU (using CUDA), PVA (Programmable Vision Accelerator), VIC (Video and Image Compositor) and OFA (Optical Flow Accelerator).

See the following sections for information about which backends is supported by which devices.

Backend	Device/platform
CPU	All devices on x86 (linux) and Jetson aarch64 platforms
CUDA	All devices on x86 (linux) with a Maxwell or superior NVIDIA GPU, and Jetson aarch64 platforms
PVA	All Jetson AGX Orin and Jetson Orin NX devices
VIC	All Jetson devices.
OFA	All Jetson Orin devices.

Algorithms

VPI supports computer vision algorithms for several purposes, such as calculating disparity between stereo images abd Harris keypoint detection, and image blurring. Some algorithms use temporary buffers, called a VPIPayload, to perform the processing. Payloads can be created once, then reused each time the algorithm is submitted to a stream. Sometimes a payload is created for a given input image size. In this case, the payload cannot be reused for different input image sizes.

Data Buffers

VPI encapsulates data into buffers for each algorithm that it works with. VPI provides abstractions for 2D images, 1D data arrays, and 2D image pyramids. VPI can allocate and manage these abstractions. Additionally, for images and arrays, VPI can wrap externally allocated memory to be used directly by algorithms. In both cases, VPI attempts to achieve high throughput by means of zero-copy (shared) memory mapping to the target backend. If VPI cannot use zero-copy memory mapping, usually due to alignment issues or other memory characteristics, it seamlessly performs deep memory copies as needed.

2D Images

VPI represents 2D images by one block of memory of a specified width, height, and image format. Once a 2D image's image size and format are defined during construction, they cannot be changed.

1D Arrays

VPI 1D arrays are basically linear blocks of memory of a given type and capacity. Capacity is measured in units determined by the array's type. Unlike an image, an array's size can vary over time, although it must not exceed the array's capacity.

2D Image Pyramids

A pyramid is a collection of 2D images with the same format. Image pyramids are defined by:

The number of levels, from fine to coarse
The width and height of the finest level
The scale from one level to the next
The image format

User-Allocated Memory Wrapping

If a user application has existing memory buffers that must serve as input and/or output buffers to VPI algorithms, you can wrap them into a VPIImage or a VPIArray. This is the case, for example, when the main loop grabs a frame from a camera device and inputs it to a VPI processing pipeline. Depending on the frame characteristics, this memory block is used by the VPI without any memory copies being made. This is another instance of zero-copy memory mapping.

User-allocated wrapped memories should be avoided when possible, specially for temporary buffers to be used in a sequence of algorithm invocations. Instead, it's recommended to user VPI-allocated buffers. They are in such a way that zero-copy mapping is more likely to happen, increasing pipeline performance.

Synchronization Primitives

VPI offers several ways to coordinate work among different streams and to ensure that tasks are executed in the proper order

You can synchronize a given stream to the calling thread, making the calling thread wait until all work submitted to the stream so far is finished. The application can inspect and/or forward the final results to another stage, such as visualization.
For more fine-grained coordination between streams, you can use VPIEvent to make one stream, or the calling thread, wait for a particular task to finish on one or more streams, effectively implementing a barrier synchronization mechanism.

VPI Applications

VPI applications consist of three major stages:

Initialization, in which memory is allocated, VPI objects such as streams, images, arrays, and contexts are created, and other one-time initialization tasks such as setup take place.
Processing loop, in which external data is wrapped for use by VPI. The application spends most of its time in this stage. The processing loop submits payloads created during initialization to streams. It reads results from and passes them to other stages for further processing or visualization.
Cleanup, in which all objects allocated during initialization are destroyed.

Go to the tutorial to learn how to build your first VPI application.

VPI - Vision Programming Interface

3.1 Release