VPI - Vision Programming Interface

0.2.0 Release

Basic concepts

VPI is comprised of asynchronous compute streams that run algorithms on buffers in the available compute backends. Streams are synchronized using events. All these objects are owned by a particular context.

The components are described in the following sections.

Backends

Backend represents the compute hardware that ultimately runs an algorithm. VPI supports CPU, GPU (using CUDA) and PVA (Programmable Vision Accelerator) on devices that support it, such as Jetson AGX Xavier and Jetson Xavier NX.

Streams

A VPI stream is an asynchronous queue that executes algorithms in sequence on a given backend. To achieve high-level parallelism across backends, a given processing pipeline can be configured with processing stages running concurrently, each one in its VPI stream. VPI streams can collaborate with each other by exchanging data structures with the help of synchronization primitives provided by VPI.

Algorithms

VPI supports several computer vision algorithms, used to calculate disparity between stereo images, Harris keypoints detection, image blurring, etc. Some algorithms may need temporary buffers, called VPI Payload to perform the processing. Payloads can be created once, then reused each time the algorithm is submitted to the same stream. Sometimes a payload is created for a given input image size. In this case, the payload cannot be reused for different input image sizes.

Buffers

VPI abstracts the data into buffers for each algorithm that it works with. VPI provides abstractions for 2D images, 1D data arrays, and 2D image pyramids. These abstractions can be allocated and managed by VPI. Additionally, for images and arrays, VPI can wrap externally-allocated memories to be used directly by algorithms. In both cases, to achieve high throughput, VPI attempts to perform shared, or zero-copy, mapping of the memory to the target backend. If unsuccessful, usually due to alignment issues or other memory characteristics, VPI seamlessly performs deep memory copies as needed.

2D Images

2D images are represented by a block of memory of a given width, height, and image type. Once the image size and type is defined during construction, it cannot be changed. Image type is one of the VPIImageType enum, and can be:

  • Simple unsigned 1-byte: where each pixel value ranges from 0 to 255
  • Signed: ranging from -128 to 127
  • Complex: 2- or 4-byte pixels
  • Multi-planar types such as NV12

Not all algorithms support all image types provided; however, usually several are supported.

The 2D image is laid out in memory row by row, one after the other. Each row can be larger than necessary, with some padding added to the end to have properly aligned row start addresses.

1D Arrays

In VPI 1D arrays are basically a linear block of memory of a given capacity, in elements, and type. Different from images, an array size can vary over time, not exceeding the allocated capacity.

Array types are drawn from VPIArrayType enum. Algorithms that require arrays inputs/outputs, such as KLT template tracker, usually accept one specific type of array.

2D Image Pyramids

Conceptually, one pyramid is a collection of 2D images of the same type. Image pyramids are defined by:

  • The number of levels, from fine to coarse
  • The width and height of the finest level
  • The scale from one level to the next
  • The image type

Wrapping User-allocated Memory vs VPI-allocated Memory

If the user application has an existing memory buffer that must serve as input and/or output buffers to VPI algorithms, the user has the option to wrap them into a VPI Image. This is the case, for example, when the main loop grabs a frame from a camera device and inputs it to a VPI processing pipeline. Depending on the frame characteristics, this memory block is used by the VPI without any memory copies being made; the so-called zero-copy shared memory mapping.

In other situations, for specifying temporary buffers used in a sequence of algorithm invocations, the user can allocate these buffers using VPI. Buffers created this way are allocated in such way that zero-copy shared mapping is more likely to happen, thereby increasing pipeline performance.

Synchronization Primitives

There are several facilities to coordinate work among different streams and guarantee proper task execution ordering:

  • Synchronize a given stream to the calling thread, thereby having the latter wait until all work submitted so far to the stream to finish. The application can inspect and/or forward the results of all processing to another stage, like visualization.
  • For more fine-grained coordination between streams, use VPI Events to have one stream, or the calling thread, wait for a particular task to finish on one or more streams, effectively implementing a barrier synchronization mechanism.

VPI Applications

The application stages include:

  1. Initialization: where memory is allocated, VPI objects, such as streams, images, arrays, and contexts, are created and setup, and any other one-time initialization tasks take place.
  2. Processing loop: where the application spends most of its time. External data is wrapped for use by VPI. Payloads created during initialization are submitted to streams. Results are read from and passed to other stages for further processing or visualization.
  3. Clean up: all objects allocated during initialization are destroyed.

Consult the tutorial to build your first VPI application.