VPI is used to implement asynchronous computing pipelines suited for real-time image processing applications. Pipelines are composed of one or more asynchronous compute streams that run algorithms on buffers in the available compute backends. Synchronization between streams is done using events.
All of these elements are described below.
A VPIStream is an asynchronous queue that executes algorithms in sequence on a given backend device. To achieve high degree of parallelism across backends, a given processing pipeline can be configured with several processing stages running concurrently, each one in its VPI stream. VPI streams can collaborate with each other by exchanging data structures with the help of synchronization primitives provided by VPI.
A backend comprises the compute hardware that ultimately runs an algorithm. VPI supports the backends CPU, GPU (using CUDA), PVA (Programmable Vision Accelerator), VIC (Video and Image Compositor) and OFA (Optical Flow Accelerator).
See the following sections for information about which backends is supported by which devices.
| Backend | Device/platform |
|---|---|
| CPU | All devices on x86 (linux) and Jetson aarch64 platforms |
| CUDA | All devices on x86 (linux) with a Maxwell or superior NVIDIA GPU, and Jetson aarch64 platforms |
| PVA | All Jetson AGX Orin, Jetson AGX Thor and Jetson Orin NX devices |
| VIC | All Jetson devices. |
| OFA | All Jetson Orin and Thor devices. |
VPI supports computer vision algorithms for several purposes, such as calculating disparity between stereo images abd Harris keypoint detection, and image blurring. Some algorithms use temporary buffers, called a VPIPayload, to perform the processing. Payloads can be created once, then reused each time the algorithm is submitted to a stream. Sometimes a payload is created for a given input image size. In this case, the payload cannot be reused for different input image sizes.
VPI encapsulates data into buffers for each algorithm that it works with. VPI provides abstractions for 2D images, 1D data arrays, and 2D image pyramids. VPI can allocate and manage these abstractions. Additionally, for images and arrays, VPI can wrap externally allocated memory to be used directly by algorithms. In both cases, VPI attempts to achieve high throughput by means of zero-copy (shared) memory mapping to the target backend. If VPI cannot use zero-copy memory mapping, usually due to alignment issues or other memory characteristics, it seamlessly performs deep memory copies as needed.
VPI represents 2D images by one block of memory of a specified width, height, and image format. Once a 2D image's image size and format are defined during construction, they cannot be changed.
VPI 1D arrays are basically linear blocks of memory of a given type and capacity. Capacity is measured in units determined by the array's type. Unlike an image, an array's size can vary over time, although it must not exceed the array's capacity.
A pyramid is a collection of 2D images with the same format. Image pyramids are defined by:
If a user application has existing memory buffers that must serve as input and/or output buffers to VPI algorithms, you can wrap them into a VPIImage or a VPIArray. This is the case, for example, when the main loop grabs a frame from a camera device and inputs it to a VPI processing pipeline. Depending on the frame characteristics, this memory block can be used by VPI without any memory copies being made. This is another instance of zero-copy memory mapping.
Wrapping external memory is not a guarantee that every enabled backend will access that original allocation directly. VPI maps each image to the memory representation required by the backend that consumes it, and it falls back to an internal copy when a shared mapping is not available. For example, wrapping a CUDA pitch-linear pointer in a VPIImage is direct for CUDA algorithms, but using that same image from PVA, VIC, OFA, or another Tegra engine can require VPI to stage the data in a backend-compatible allocation. Wrapping an NvBufSurface can likewise require an internal copy when VPI needs an image backed by NvSciBuf for another backend. The copy is handled by VPI, but it adds memory traffic and synchronization to the pipeline.
For mixed VPI/CUDA or VPI/multimedia pipelines, prefer creating the image with vpiImageCreate and passing the VPIBackend flags for each backend that will consume the image. Then export the interop handle needed by the non-VPI code with vpiImageLockData. For CUDA interop, lock the VPI-owned image as VPI_IMAGE_BUFFER_CUDA_PITCH_LINEAR and use the returned VPIImageBufferPitchLinear data while the image is locked. This allocation direction lets VPI choose memory that is compatible with the selected backends, which can allow zero-copy use across those backends when the platform and image format support it. It is more likely to avoid internal staging copies than wrapping a pre-existing CUDA pointer or NvBufSurface and later using it from a different backend. See Lock And Extract Interop Handles for a complete example.
User-allocated wrapped memories should be avoided when possible, especially for temporary buffers to be used in a sequence of algorithm invocations. Instead, use VPI-allocated buffers so VPI can choose memory that is compatible with the intended backend set and with the interop handles that the application will lock later.
VPI offers several ways to coordinate work among different streams and to ensure that tasks are executed in the proper order
VPI applications consist of three major stages:
Go to the tutorial to learn how to build your first VPI application.