VPI is a software library that provides a collection of computer vision and image processing algorithms that can be seamlessly executed in a variety of hardware accelerators. These accelerators are called backends.
The goal of VPI is to provide a uniform interface to the computing backends while maintaining high performance. It achieves this by exposing a thin, but effective, software abstraction of the underlying hardware and the data it manipulates.
This diagram illustrates the architecture of VPI:
The API follows a paradigm in which object allocation and setup take place in an initialization phase. Following is the application loop, where the main processing occurs, using the objects created during initialization. When main processing is complete, the created objects are destroyed and the environment is cleaned up. In resource-constrained embedded environments, where memory allocations are limited in both time and space, the control over memory allocation and lifetime that VPI provides is beneficial.
The core components of VPI include:
VPI can be used in the following platforms/devices:
Algorithms represent actual computing operations. They act on one or more input buffers and write their results to output buffers provided by the user. They run asynchronously with respect to the application thread. For a list of supported algorithms, see the Algorithms section.
There are two classes of algorithms:
Some algorithm implementations, such as FFT or KLT Feature Tracker, require temporary resources to function properly. These resources are encapsulated by a VPIPayload object associated with the algorithm.
Before you can execute an algorithm you must create the corresponding payload at initialization time, passing parameters that are used to allocate temporary resources and to specify the backends which may execute the algorithm. In the main loop, where computation is performed, you submit an algorithm instance to a stream for execution, providing the corresponding payload along with input and output parameters. You can reuse the payload in multiple algorithm instances, but you must ensure that the payload is used by only one instance at a time.
When a payload is no longer needed, you must destroy it by calling vpiPayloadDestroy. This function deallocates any resources encapsulated in the payload.
Examples:
Some algorithms do not require temporary resources. Such algorithms include Box Filter and Rescale, among others. For these algorithms no payload handling is necessary, and the sequence of operations is simplified. All the required data is sent during algorithm submission.
Example:
Every algorithm supported by VPI is implemented in one or more backends. Different implementations of the same algorithm return similar results when given the same inputs, but small variations between their results may occur. This is mostly due to optimizations tailored to a particular backend, such as use of fixed-point instead of floating-point arithmetic.
This backend represents the device's CPU. It may create a set of background worker threads and data structures supporting efficient parallel execution across multiple cores. These worker threads might be shared among different streams and/or context instances.
VPI provides mechanisms which allow you to define your own CPU task scheduling scheme by calling vpiContextSetParallelFor with a VPIParallelForCallback function that VPI is to call when CPU tasks need to be executed.
The CUDA backend has an explicit affinity with a particular CUDA-enabled GPU, defined during construction of a stream. This means that algorithms submitted for execution on this stream are handled by this GPU.
The CUDA backend manages a cudaStream_t
handle and other CUDA device information that allow it to launch the underlying CUDA kernels.
VPI takes advantage of the asynchronous nature of CUDA kernel launches to optimize their launching. In some situations, especially when no user-defined functions have been submitted to the stream, the backend launches a CUDA task directly from the caller thread, bypassing the worker thread entirely.
When only CUDA algorithms are involved, VPI generally acts as an efficient thin layer on top of the CUDA SDK.
You must set up the CUDA context for the calling thread properly before you construct the API context. The resulting context object uses the corresponding CUDA context for internal kernel calls.
The Programmable Vision Accelerator (PVA) is a processor in NVIDIA® Jetson AGX Orin™ and NVIDIA® Jetson Orin™ NX devices that is specialized for image processing and computer vision algorithms.
Use the PVA backend when you need to leave the GPU free to run other tasks that only it can perform, such as deep learning inference stages and algorithms only implemented on CUDA backend.
PVA hardware is much more power-efficient than CPU and CUDA hardware. Therefore, use the PVA backend where possible if power is at a premium.
Each Jetson AGX Orin or Jetson XOrin NX device comprises one PVA processor, each one contains two vector processors. Therefore, the device can execute at most two independent PVA tasks concurrently.
When multiple VPI streams have the PVA backend enabled, they each choose one available PVA vector processor in round-robin succession.
The Video Image Compositor (VIC) is a fixed-functionality processor in Jetson devices that is specialized for low-level image processing tasks, such as rescaling, color space conversion, noise reduction, and compositing.
Like a PVA backend, a VIC backend allows you to offload tasks from the GPU, leaving it free for other processing, if performance is not at a premium.
The NVIDIA Optical Flow Accelerator (OFA) is a specialized processor in the new Jetson AGX Orin devices for calculating the optical flow between images. It's currently being used as a backend in Stereo Disparity Estimator.
The VPIStream object is the main entry point to the API. It is loosely based on CUDA's cudaStream_t
. This object represents a FIFO command queue which stores a list of commands to be executed by some backend. The commands might run a particular computer vision algorithm, perform a host function (using vpiSubmitHostFunction), or signal an event.
At initialization, a stream is configured to use the backends that are to execute the tasks submitted to it. By default, it uses the backends enabled by the current context. When you create a stream you can set flags to further limit the number of available backends and reduce resource usage.
Each stream launches an internal worker thread for dispatching tasks, allowing asynchronous task execution with respect to calling (user) thread. This means that when an algorithm submission call is invoked to a VPI stream on a particular backend, the function pushes a corresponding command to the VPIStream worker thread and immediately returns the execution to the calling thread.
Tasks pushed to a worker thread aren't processed immediately. They are initially gathered in an staging queue. These tasks are only processed when the stream is flushed. This operations moves all tasks from the staging queue into a processing queue which eventually submits the tasks to the backends associated with them.
The following events trigger a stream flush:
The staging queue allows for processing pipeline optimization opportunities, such as minimization of memory mapping operations, among others.
For more information, see Stream in the "C API Reference" section of VPI - Vision Programming Interface.
Buffers represent the data that VPI algorithms work with. VPI supports abstractions for three kinds of data:
VPI can allocate all three types of buffers. For images and arrays, it can wrap data into a VPI buffer and store it in pre-allocated memory. This is useful when an application requires interoperability with libraries other than VPI, as when it uses an OpenCV cv::Mat
buffer as input to a VPI algorithm.
All buffer types share the attributes of size and element type.
VPI images represent any kind of 2D data, such as actual images, vector fields embedded in a 2D space, and 2D heat maps.
VPI images are characterized by their size (width and height) and format.
When an application creates a VPIImage object, it passes flags that specify which backend the image can work with. You can set the flags with one of the VPIBackend enums, or with two or more enums OR'ed together. When no backend flags are passed, VPI enables all of the backends allowed by the current context, which by default are all available backends.
For more information, see Image in the "C API Reference" section of VPI - Vision Programming Interface.
VPI Image views represent a rectangular region in an existing 2D data of a Image, please see a description of Images above.
VPI image views are created from an existing VPI image and are characterized by their clip region, i.e. a rectangle defined by a start position (x, y) and a size (width, height). They share the same context and format of the original source image, but their size (width and height) is the size of the rectangular region.
When an application creates a VPIImage object to be an image view, it passes flags that specify the backend the same way it does with regular images.
For more information, refer to the reference documentation of vpiImageCreateView and vpiImageSetView functions.
To make image contents available for access outside VPI, the image must lock the image buffer by calling the vpiImageLockData function. This ensures that all changes made to the memory are committed and made available for access outside VPI.
Depending on the buffer type that is being made available, the image must have certain backends enabled, see vpiImageLockData documentation for more details. vpiImageLockData fills the VPIImageData object with image information that allows you to address and interpret all image pixels properly. When you are done working on the image data on the host, call vpiImageUnlock.
Images must also be locked when they wrap buffers allocated outside VPI and these buffers are being accessed externally to VPI. In this case, you aren't interested in retrieving the image contents via VPI calls. Instead, call vpiImageLock to lock the image contents. Only then can they be accessed directly via the wrapped buffer. Call vpiImageUnlock to let VPI access the buffers when this is done. Again, while the buffer is locked, streams trying to access its contents will fail with VPI_ERROR_BUFFER_LOCKED.
When an image is locked, it cannot be accessed by an algorithm running asynchronously. It can, however, be locked recursively by the same thread that locked it initially. Remember to pair each vpiImageLockData call with a corresponding vpiImageUnlock.
VPI supports a variety of image formats representing different pixel types, such as single-channel 8, 16, or 32-bit, unsigned and signed, multi-channel RGB and RGBA, semi-planar NV12, etc.
The image format is represented by the VPIImageFormat enum. Each format is defined by several attributes, such as color space, number of planes, and data layout. There are functions to extract each component from the image format, as well as to modify an existing one.
Not all algorithms support all recognized image formats. Most offer a choice of several formats, though. The supported formats are listed in each algorithm's API reference documentation.
2D images are most commonly laid out in memory in pitch-linear format, i.e. row by row, one after another. Each row can be larger than necessary to hold the image's data to conform with row address alignment restrictions.
You can also create or wrap memory using a proprietary block-linear layout. For some algorithms and backends it can be more efficient to create 2D memories using this format.
For more information, see Image Formats in the "C API Reference" section of VPI - Vision Programming Interface.
You can create images that wrap externally allocated memory using the function vpiImageCreateWrapper. In each case, you must fill a VPIImageData structure with the required information and pass it to the function. Please see its API reference documentation for information on the memory types that can be wrapped.
In all of these cases, the VPIImage object does not own the memory buffer. When the VPIImage is destroyed, the buffer is not deallocated.
Like the function for creating image buffers managed by VPI, these wrapping functions accept flags that specify which backends they can be used with.
VPI arrays represent 1D data, such as keypoint lists, bounding boxes, and transforms.
Arrays are characterized by their capacity, size, and element type. As with images, the flags are used to specify which backends they can work with.
Array types are drawn from enum VPIArrayType. Algorithms that require arrays for input or output, such as the KLT template tracker, usually accept one specific array type.
VPIArray's behavior is slightly different from other memory buffers: while the capacity of an array is fixed for the lifetime of the object, its size can change. Any API that writes to an array must set the size parameter to the number of valid elements in the array. You can use vpiArrayGetSize and vpiArraySetSize to query and modify the size of an array.
For more information, see Array in the "C API Reference" section of VPI - Vision Programming Interface.
Array data can be accessed outside VPI using the vpiArrayLockData function. This function works like its image counterpart. It too supports recursive locking by the same thread.
You can also create arrays that wrap externally allocated CUDA and host memory using the function vpiArrayCreateWrapper. In both cases, you must fill a VPIArrayData structure with the required information and pass it to the function.
VPI pyramids represent a collection of VPI images stacked together, all with the same format, but possibly with different dimensions.
A pyramid is characterized by its number of levels, base level dimensions, scale factor, and image format. The scale factor represents the ratio of one level's dimension over the prior level's dimension. For instance, when scale=0.5, the pyramid is dyadic, i.e., dimensions are power-of-two.
It is often necessary to process one pyramid level as the input or output of a VPI algorithm. Then you must use vpiImageCreateWrapperPyramidLevel to identify the pyramid and its level to be wrapped. The resulting image inherits the pyramid's enabled backends. You can use the returned VPIImage handle like any other image. When you are done using the image, you must destroy it with vpiImageDestroy.
For more information, see the Pyramid in the "C API Reference" section of VPI - Vision Programming Interface.
As with images and arrays, you can access a whole pyramid's contents outside VPI using the function vpiPyramidLockData, provided that the pyramid has enabled the backend corresponding to the returned buffer type. See vpiPyramidLockData for more information. This function fills a VPIPyramidData structure that contains an array of VPIImageData. When you are done using the VPIPyramidData, call vpiPyramidUnlock to unmap the pyramid from the host and free its resources.
Recursive locking works for pyramids just as images and arrays.
Each compute function in the API is executed asynchronously with respect to the calling thread; that is, it returns immediately rather than waiting for the operation to complete. There are two ways to synchronize the operation with the backend.
One method is to wait until all of the commands in the VPIStream queue are finished by calling vpiStreamSync. This method is simple, but it can't provide synchronization that is fine-grained (e.g., "wait until function X is completed") or inter-stream (e.g., "wait until function C in stream D completes before running function A in stream B").
The other method provides more flexible synchronization by using VPIEvent objects. These objects are conceptually like binary semaphores, and are designed to mimic events in CUDA API closely:
For more information, see Event in the "C API Reference" section of VPI - Vision Programming Interface.
A context encapsulates all resources used by VPI to perform operations. It automatically cleans up these resources when the context is destroyed.
Every application CPU thread has an active context. Each context owns the VPI objects created while it is active.
By default, all application threads are associated with the same global context, which is created automatically by VPI when the first VPI resource is created. You do not need to perform any explicit context management in this case, everything is handled by VPI under the hood.
When finer control of contexts is needed, user-created contexts are an option. Once created, a context can be pushed to the current application thread's context stack, or can replace the current context. Both actions make the created context active. Refer to Context Stack for more information on how to manipulate contexts.
You can specify several properties associated with a context when you create it, such as which backends are supported by created objects when the context is active. This effectively allows you to mask support for a particular backend. For example, stream creation for a CUDA backend fails if the current context doesn't have the VPI_BACKEND_CUDA flag set. If you don't pass backend flags, the context inspects the running platform and enables the backends associated with all available hardware engines.
Objects (buffers, payloads, events, etc.) cannot be shared among different contexts.
There is no limit to the number of created contexts except available memory.
For more information, see Context in the "C API Reference" section of VPI - Vision Programming Interface.
By default, VPI creates a single global context before it creates any VPI objects. This global context is initially shared among all application threads, and cannot be destroyed by the user.
For most use cases, an application can use the global context. When an application requires finer control of how objects are grouped together, or it needs a degree of independence between pipelines, you may want to create and manipulate contexts explicitly.
Each application thread has a context stack not shared with other threads.
The top context in the stack is the current context for that thread.
By default, the context stack has one context in it, the global context. Consequently, all new threads have the same global context set as the current thread.
Setting a context current in a given stack amounts to replacing the top context, either the global context or the most recently pushed context, with the given context. The replaced context does not belong to the stack anymore.
However, pushing a context into a stack does not replace anything. The top context is kept in the stack and the new pushed context is put at the top, thereby becoming the new current context.
The user can push and pop contexts from the stack at will. This allows for temporarily creating pipelines in a new context without disturbing the existing context.
To avoid leakage, it is important to match the number of pushes and pops on a given context stack. Be aware that the context stack can have at most eight contexts in it.
All API functions are thread-safe. Concurrent host access to API objects is serialized and executed in an unspecified order. All API calls use a VPIContext instance that is thread-specific and is stored in Thread Local Storage (TLS). If the context pointer for the current thread is NULL (no context is set), all API calls use the default global context created during library initialization.
API objects have no concept of thread affinity; that is, if several threads use the same context instance, an object created in one thread can safely be destroyed by another thread.
Most of the API functions are non-blocking. The functions that can block when called are vpiStreamSync, vpiStreamDestroy, vpiContextDestroy, vpiEventSync and the several vpiSubmit*
functions (which block when the stream command queue is full). Since implicit synchronization in the API implementation is minimal, you must ensure that the resulting order of dependent function calls is legal.
Pipeline examples, and how to implement them using VPI, are explained in the following sections.
In this example, a pipeline with a simple box filter operation is implemented to process an input image. This is quite similar to the Image Blurring tutorial.
The code for implementing the pipeline is as follows:
Language:Import the vpi
module.
Create the input image buffer to be used
The example creates a 640x480 1-channel (grayscale) input image with unsigned 8-bit pixel elements, which are represented by vpi.Format.U8
. VPI initializes images with zeros upon creation.
Within a Python context that defines vpi.Backend.CUDA
as the default backend, call the box_filter
method on the input image. The 3x3 box filter algorithm will be executed by the CUDA backend on the default stream. The result will be returned into a new image output
.
Include the necessary headers. This example needs headers for image buffers, a stream, and the Box Filter algorithm.
Create the image buffers to be used.
The example creates a 640x480 1-channel (grayscale) input image with unsigned 8-bit pixel elements, which are represented by VPI_IMAGE_FORMAT_U8 enum. VPI initializes images with zeros upon creation. Pass all-zero image flags to indicate that this image may be used in all available hardware backends. This makes it easier to submit algorithms to different backends later on, at the cost of using more resources. The output image is created the same way.
Create a stream to execute the algorithm. Pass all-zero stream flags to indicate that the algorithm may be executed in any available hardware backend, to be specified later.
Submit the box filter algorithm to the stream, along with the input and output images and other parameters. In this case, the filter algorithm is a 3x3 box filter with clamp boundary condition. It is to be executed by the CUDA backend.
In general, because of the asynchronous nature of streams, the algorithm is enqueued on the stream's work thread, and the function returns immediately. Later it is submitted for execution in the backend. Using a work thread allows the program to continue assembling the processing pipeline, or do some other task, while the algorithm executes in parallel.
Wait until the stream finishes processing.
This function blocks until all algorithms submitted to the stream finish executing. The pipeline must do this before it can display the output, or save it to disk, etc.
Destroy created objects.
When the pipeline finishes using the created objects, it destroys them to prevent memory leaks. Destroying a stream forces it to synchronize, but destroying an image that is still being used by an algorithm leads to undefined behavior, most likely resulting in a program crash.
In this example, NVIDIA recommends examining how several VPI objects work together, and inspecting the ownership relationship between objects.
This is a conceptual structure of the provided C/C++ example:
Where:
More complex scenarios may take advantage of different acceleration processors on the device and create a pipeline that best utilizes its full computational power. To do that, the pipeline must have parallelizable stages.
The next example implements a full stereo disparity estimation and Harris corners extraction pipeline, which presents plenty of opportunities for parallelization.
The diagram reveals three stage parallelization opportunities: the independent left and right image preprocessing, and the Harris corners extraction. The pipeline uses a different backend for each processing stage, depending on each backend's processing speed, power requirements, input and output restrictions, and availability. In this example, processing is split among the following backends:
This choice of backends keeps the GPU free for processing other tasks, such as deep learning inference stages. The image format conversion operation is quite fast on CUDA, and does not interfere much. The CPU is kept busy extracting Harris keypoints undisturbed.
The following diagram shows how the algorithms are split into streams and how the streams are synchronized.
Both stream left and stream right start stereo pair preprocessing, while the keypoints stream waits until the right grayscale image is ready. Once it's ready, Harris corner detection starts while stream right continues preprocessing. When preprocessing ends on the left stream, the stream waits until the right downscaled image is ready. Finally, stereo disparity estimation starts with its two stereo inputs. At any point the host thread can issue a vpiStreamSync call in both left and keypoints stream to wait until the disparity and keypoints data is ready for further processing or display.
The outline above explains the code that implements this pipeline:
Create a context and make it active.
Although you can use the default context that is created automatically to manage the VPI state, it is may be more convenient to create a context and use it to handle all objects linked to a particular pipeline throughout their lifetimes. In the end, context destruction triggers destruction of the objects created under it. Using a dedicated context also yields better isolation between this pipeline and others that the application might use.
Create the streams.
Create the streams with all zero flags, meaning that they can handle tasks for all backends.
There are two streams to handle stereo pair preprocessing, and a third for Harris corner detection. When preprocessing is finished, stream_left
is reused for stereo disparity estimation.
Create the input image buffer wrappers.
Assuming that the input comes from a capture pipeline as EGLImage
, you can wrap the buffers in a VPIImage to be used in a VPI pipeline. All the pipeline requires is one frame (usually the first) from each stereo input.
Create the image buffers to be used.
Like the simple pipeline, this pipeline creates empty input images. These input images must be populated either by wrapping images existing in memory, or from the output of an earlier VPI pipeline.
The input is a 640x480 NV12 (color) stereo pair, typically output by camera capture pipelines. The temporary images are needed for storing intermediate results. The format conversion is necessary because the stereo disparity estimator and Harris corner extractor expect grayscale images. Moreover, stereo disparity expects its input to be exactly 480x270. This is accomplished by the rescale stage in the diagram above.
Define stereo disparity algorithm parameters and create the payload.
Stereo disparity processing requires some temporary data. VPI calls this data a payload. In this example, vpiCreateStereoDisparityEstimator is called and passed all of the parameters required by the internal allocator to specify the size of the temporary data.
Because the temporary data is allocated on a backend device, the payload is tightly coupled to the backend. If the same algorithm is to be executed in different backends, or concurrently using the same backend in different streams, it requires a payload for each backend or stream. In this example, the payload is created for execution by the PVA backend.
As for algorithm parameters, the VPI stereo disparity estimator is implemented by a semi-global stereo matching algorithm. The estimator requires the census transform window size, specified as 5, and the maximum number of disparity levels, specified as 64. For more information, see Stereo Disparity Estimator.
Create the image rectification payload and corresponding parameters. It does lens distortion correction using the Remap algorithm. Here the stereo lens parameters are specified. Because they are different for left and right lenses, two remap payloads are created. For more details, see Lens Distortion Correction.
Create output buffers for Harris Corner Detector.
This algorithm receives an image and outputs two arrays, one with the keypoints themselves and another with the score of each keypoint. A maximum of 8192 keypoints are returned, and this must be the array capacity. Keypoints are represented by VPIKeypointF32 structures and scores by 32-bit unsigned values. For more information, see Harris Corner Detector.
Define Harris detector parameters and create the detector's payload.
Fill the VPIHarrisCornerDetectorParams structure with the required parameters. See the structure documentation for more information about each parameter.
Like stereo disparity, the Harris detector requires a payload. This time only the input size (640x480) is needed. The pipeline only accepts input payloads of this size.
Create events to implement barrier synchronization.
Events are used for inter-stream synchronization. They are implemented with VPIEvent. The pipeline needs two barriers: one to wait for the input to Harris corner extraction to be ready, and the other for the preprocessed right image.
Submit the left frame processing stages.
Lens distortion correction, image format conversion, and downscaling are submitted to the left stream. Note again that the submit operations are non-blocking and return immediately.
Submit the first few stages of the right frame preprocessing.
The lens distortion correction and image format conversion stages result in a grayscale image for input to Harris corner extraction.
Record the right stream state so that the keypoints stream can synchronize to it.
The keypoint stream can only start when its input is ready. First, the barrier_right_grayscale event must record the right stream state by submitting a task to it that will signal the event when format conversion completes.
Finish the right frame preprocessing with a downscale operation.
Record the right stream state so that the left stream can synchronize to it.
With the whole of right preprocessing submitted, the stream state must be recorded again so that the left stream can wait until the right frame is ready.
Make the left stream wait until the right frame is ready.
Stereo disparity requires the left and right frames to be ready. The pipeline uses vpiStreamWaitEvent to submit a task to the left stream that will wait until the barrier_right_reduced
event is signaled on the right stream, meaning that right frame preprocessing is finished.
Submit the stereo disparity algorithm.
The input images are now ready. Call vpiSubmitStereoDisparityEstimator to submit the disparity estimator.
Submit the keypoint detector pipeline.
For keypoint detection, first submit a wait operation on the barrier_right_grayscale
event to make the pipeline wait until the input is ready. Then submit the Harris corner detector on it.
Synchronize the streams to use the disparity map and keypoints detected.
Remember that the functions called so far in the processing phase are all asynchronous; they return immediately once the job is queued on the stream for later execution.
More processing can now be performed on the main thread, such as updating GUI status information or displaying the previous frame. This occurs while VPI is executing the pipeline. Once this additional processing is performed, the streams that process the final result from the current frame must be synchronized using vpiStreamSync. Then the resulting buffers can be accessed.
Fetch the next frame and update the input wrappers.
The existing input VPI image wrappers can be redefined to wrap the next two stereo pair frames, provided that their dimensions and format are the same. This operation is quite efficient, as it is done without heap memory allocations.
Destroy the context.
This example has created many objects under the current context. Once all processing is completed and the pipeline is no longer needed, destroy the context. All streams are then synchronized and destroyed, along with all other objects used. No memory leaks are possible.
Destroying the current context reactivates the context that was active before the current one became active.
Important takeaways from these examples: