VPI is used to implement asynchronous computing pipelines suited for real-time image processing applications. Pipelines are composed of one or more asynchronous compute streams that run algorithms on buffers in the available compute backends. Synchronization between streams is done using events.

All of these elements are described below.

Streams

A VPIStream is an asynchronous queue that executes algorithms in sequence on a given backend device. To achieve high degree of parallelism across backends, a given processing pipeline can be configured with several processing stages running concurrently, each one in its VPI stream. VPI streams can collaborate with each other by exchanging data structures with the help of synchronization primitives provided by VPI.

Backends

A backend comprises the compute hardware that ultimately runs an algorithm. VPI supports the backends CPU, GPU (using CUDA), PVA (Programmable Vision Accelerator), VIC (Video and Image Compositor) and OFA (Optical Flow Accelerator).

See the following sections for information about which backends is supported by which devices.

Backend	Device/platform
CPU	All devices on x86 (linux) and Jetson aarch64 platforms
CUDA	All devices on x86 (linux) with a Maxwell or superior NVIDIA GPU, and Jetson aarch64 platforms
PVA	All Jetson AGX Orin, Jetson AGX Thor and Jetson Orin NX devices
VIC	All Jetson devices.
OFA	All Jetson Orin and Thor devices.

Algorithms

VPI supports computer vision algorithms for several purposes, such as calculating disparity between stereo images abd Harris keypoint detection, and image blurring. Some algorithms use temporary buffers, called a VPIPayload, to perform the processing. Payloads can be created once, then reused each time the algorithm is submitted to a stream. Sometimes a payload is created for a given input image size. In this case, the payload cannot be reused for different input image sizes.

Data Buffers

VPI encapsulates data into buffers for each algorithm that it works with. VPI provides abstractions for 2D images, 1D data arrays, and 2D image pyramids. VPI can allocate and manage these abstractions. Additionally, for images and arrays, VPI can wrap externally allocated memory to be used directly by algorithms. In both cases, VPI attempts to achieve high throughput by means of zero-copy (shared) memory mapping to the target backend. If VPI cannot use zero-copy memory mapping, usually due to alignment issues or other memory characteristics, it seamlessly performs deep memory copies as needed.

2D Images

VPI represents 2D images by one block of memory of a specified width, height, and image format. Once a 2D image's image size and format are defined during construction, they cannot be changed.

1D Arrays

VPI 1D arrays are basically linear blocks of memory of a given type and capacity. Capacity is measured in units determined by the array's type. Unlike an image, an array's size can vary over time, although it must not exceed the array's capacity.

2D Image Pyramids

A pyramid is a collection of 2D images with the same format. Image pyramids are defined by:

The number of levels, from fine to coarse
The width and height of the finest level
The scale from one level to the next
The image format

User-Allocated Memory Wrapping

If a user application has existing memory buffers that must serve as input and/or output buffers to VPI algorithms, you can wrap them into a VPIImage or a VPIArray. This is the case, for example, when the main loop grabs a frame from a camera device and inputs it to a VPI processing pipeline. Depending on the frame characteristics, this memory block can be used by VPI without any memory copies being made. This is another instance of zero-copy memory mapping.

Wrapping external memory is not a guarantee that every enabled backend will access that original allocation directly. VPI maps each image to the memory representation required by the backend that consumes it, and it falls back to an internal copy when a shared mapping is not available. For example, wrapping a CUDA pitch-linear pointer in a VPIImage is direct for CUDA algorithms, but using that same image from PVA, VIC, OFA, or another Tegra engine can require VPI to stage the data in a backend-compatible allocation. Wrapping an NvBufSurface can likewise require an internal copy when VPI needs an image backed by NvSciBuf for another backend. The copy is handled by VPI, but it adds memory traffic and synchronization to the pipeline.

For mixed VPI/CUDA or VPI/multimedia pipelines, prefer creating the image with vpiImageCreate and passing the VPIBackend flags for each backend that will consume the image. Then export the interop handle needed by the non-VPI code with vpiImageLockData. For CUDA interop, lock the VPI-owned image as VPI_IMAGE_BUFFER_CUDA_PITCH_LINEAR and use the returned VPIImageBufferPitchLinear data while the image is locked. This allocation direction lets VPI choose memory that is compatible with the selected backends, which can allow zero-copy use across those backends when the platform and image format support it. It is more likely to avoid internal staging copies than wrapping a pre-existing CUDA pointer or NvBufSurface and later using it from a different backend. See Lock And Extract Interop Handles for a complete example.

User-allocated wrapped memories should be avoided when possible, especially for temporary buffers to be used in a sequence of algorithm invocations. Instead, use VPI-allocated buffers so VPI can choose memory that is compatible with the intended backend set and with the interop handles that the application will lock later.

Synchronization Primitives

VPI offers several ways to coordinate work among different streams and to ensure that tasks are executed in the proper order

You can synchronize a given stream to the calling thread, making the calling thread wait until all work submitted to the stream so far is finished. The application can inspect and/or forward the final results to another stage, such as visualization.
For more fine-grained coordination between streams, you can use VPIEvent to make one stream, or the calling thread, wait for a particular task to finish on one or more streams, effectively implementing a barrier synchronization mechanism.

VPI Applications

VPI applications consist of three major stages:

Initialization, in which memory is allocated, VPI objects such as streams, images, arrays, and contexts are created, and other one-time initialization tasks such as setup take place.
Processing loop, in which external data is wrapped for use by VPI. The application spends most of its time in this stage. The processing loop submits payloads created during initialization to streams. It reads results from and passes them to other stages for further processing or visualization.
Cleanup, in which all objects allocated during initialization are destroyed.

Go to the tutorial to learn how to build your first VPI application.

VPI - Vision Programming Interface

4.1 Release