VPI is a library that provides a collection of computer vision and image processing algorithms that can be seamlessly executed in a variety of hardware accelerators, called backends.
The goal is to provide a uniform interface to these backends, while maintaining high performance. To achieve that, several shared memory mapping mechanisms between backends are used, depending on memory characteristics, coupled with high performance implementations of algorithms and availability of backend-agnostic event synchronization mechanisms.
The VPI architectural overview is as follows:
The API follows the paradigm where object allocation and setup take place in an initialization phase. The application loop, where the main processing occurs, then follows, using the objects created during initialization. Once completed, the created objects are destroyed and the environment is cleaned up. For robotics software applications where memory allocations are limited in both time and space, the amount of memory management control provided by VPI is beneficial.
The core components of VPI include:
VPI contexts serve as a container of other VPI objects along with some configurations that apply to them.
Every host thread has an active context. VPI objects created while a context is active are owned by it.
By default all host threads use the same default context, which is created automatically by VPI. There's no need for explicit context management by the user in this case.
When some finer control on contexts is needed, there's the option for user-created contexts. This lets the user specify during context creation what backends this context supports at most, among other things. This effectively allows the user to mask support for a particular hardware. For example, creating a stream for CUDA backend fails if the current context doesn't have the VPI_BACKEND_CUDA flag set. When passing 0 as flags, VPI will inspect the running platform and enable the available backends.
Sharing objects (buffers, payloads, events, ...) among different contexts is not permitted.
There is no limit other than available memory for the number of created contexts.
The current context can be manipulated by the user if needed.
Refer to context API reference for more information.
By default, there is a single global context created before the first VPI object is created.
This global context is initially shared among all user threads and cannot be destroyed by the user.
For most applications, the user can use the global context. When some finer control is required, for how objects are grouped together, or some level of independence between pipelines is needed, the user may wish to explicitly create and manipulate contexts.
Each user thread has a context stack not shared with other threads.
The top context in the stack is the current context for that thread.
By default, the context stack has one context in it, the global context. Consequently, all new threads have the same global context set as current thread.
Making a context current in a given stack amounts to replacing the top context, either the global context or the most recently pushed context, with the given context. The replaced context does not belong to the stack anymore.
However, pushing a context into a stack does not replace anything. The top context is kept in the stack and the new pushed context is put at the top, thereby becoming the new current context.
The user can push and pop contexts from the stack at will. This allows for temporarily creating pipelines in a new context without disturbing the existing context.
To avoid leakage, it is important to match the number of pushes and pops on a given context stack. Be aware that the context stack can have at most 8 contexts in it.
The main entry-point to the API is the VPIStream object. This object represents a command queue, FIFO style, storing a list of commands to execute by some backend. Commands may comprise of running a particular CV algorithm, a host function (using vpiSubmitHostFunction) or signaling an event.
At creation time, it is configured with the backends that will eventually execute the tasks submitted to it. By default, when passing 0 as flags, it'll use the backends enabled by the current context. Limiting the number of available backends helps minimize resource usage.
Each stream launches an internal worker thread that implements a task queue to handle asynchronous task execution. It is not specified when the thread is created. However, this is usually upon stream creation, lasting until it is destroyed.
Invoking any CV function on a particular backend pushes a corresponding command to the VPIStream worker thread and immediately returns. The queued commands are then dispatched to the hardware backend assigned to them for execution. This allows the API functions to be executed asynchronously with respect to the calling thread.
Refer to stream API reference for more information.
Every algorithm provided by VPI is implemented in one or more backends. Different implementations of the same algorithm return functionally similar results given the same inputs. Some small variations between backends might occur, mostly due to optimizations tailored to a particular backend, e.g. use of fixed-point instead of floating-point math, etc. Exact same results aren't to be relied upon.
cudaStream_t
handle and other CUDA device information that allows launching of the underlying CUDA kernels.Buffers represent the data VPI algorithms work with. Abstractions for three kinds of data are provided:
Users can have VPI manage allocation of all three types of buffers. Or, for images and arrays, existing memory can be wrapped into a VPI buffer. This is useful when interoperability with other libraries is required, such as using a OpenCV cv::Mat
buffer as input to a VPI algorithm.
Common attributes for all buffer types are their size and the element type.
VPI images represent any kind of 2D data, such as images themselves, vector fields embedded in a 2D space, 2D heat maps, etc.
The images are characterized by their width, height and format.
When creating a VPIImage objects, the flags passed during creation are used to specify which backend the images can work with. One or more VPIBackend enums can be or-ed together. Passing 0 (or no backend flag) enables the set of backends allowed by the current context, which by default is all available backends.
Refer to image API reference for more information.
Image data can be accessed from host using the vpiImageLock function. The function requires that the image have the CPU backend enabled. It will fill the VPIImageData with image information that allows the user to properly address and interpret all image pixels. Once the user is done working on the image data from host, vpiImageUnlock must be called. Once the image is locked, it can't be accessed by an algorithm running asychronously. It can, however, be locked recursively by the same thread the locked it in the first place. Just remember to pair each vpiImageLock call with a corresponding vpiImageUnlock.
VPI supports a variety of image formats representing different pixel types such as single-channel 8-, 16- or 32-bit, unsigned and signed, multi-channel RGB and RGBA, semi-planar NV12, etc. Not all algorithms support images with all types.
The image format is represented by the VPIImageFormat enum. Each format is defined by several components, such as color space, number of planes, data layout, etc. There are functions to extract each component from the image format, as well as modifying an existing one.
Not all algorithms support all image formats provided; however, usually several are supported.
2D images are most commonly laid out in memory in pitch-linear format, i.e. row by row, one after the other. Each row can be larger than necessary, with some padding added to the end to have properly aligned row start addresses.
There's also the option for creating or wrapping memory using a proprietary block-linear layout. Depending on the algorithm and the backend, it might be more efficient to create 2D memories using this format.
See Image formats for more information.
Users can create images that wrap externally allocated CUDA and host (CPU) memory using the functions vpiImageCreateCudaMemWrapper and vpiImageCreateHostMemWrapper respectively. In both cases, the user must fill a VPIImageData structure with the required information and pass it to the function.
It's also possible to wrap an EGLImage
handle using vpiImageCreateEglImageWrapper and a NvBuffer using vpiImageCreateNvBufferWrapper.
In all these cases, the VPIImage object doesn't own the memory buffer. When the VPIImage is destroyed, the buffer isn't deallocated.
As with image buffers managed by VPI, these wrapping functions accept flags that define which backends they can be used with.
VPI arrays represent 1D data, such as keypoint lists, bounding boxes, transforms, etc.
Arrays are characterized by their capacity, size and element format. As with images, the flags are used to specify which backend they can work with.
Array formats are drawn from VPIArrayType enum. Algorithms that require arrays inputs/outputs, such as KLT template tracker, usually accept one specific array format.
VPIArray has a unique feature that while the capacity of the array is fixed for the lifetime of the object, the size can change. Any API that outputs to an array will set the size parameter to the number of valid elements contained in the array. The user also has the ability to use vpiArrayGetSize and vpiArraySetSize to query and modify the size of an array.
Refer to array API reference for more information.
Array data can be accessed from host using the vpiArrayLock function. It works like its image counterpart, including recursive locking by the same thread.
Users can also create arrays that wrap externally allocated CUDA and host memory using the functions vpiArrayCreateCudaMemWrapper and vpiArrayCreateHostMemWrapper respectively. In both cases, the user must fill a VPIArrayData structure with the required information and pass it to the function.
VPI pyramids represent a collection of VPI images stacked together, all having the same format, but possibly different dimensions.
Pyramids are characterized by their number of levels, base level dimensions, scale factor and image format. The scale factor represents the ratio of one level dimension over the prior level dimension. For instance, when scale=0.5, the pyramid is dyadic, i.e., dimensions are power-of-two.
Often it's required to process one pyramid level as input or output to a VPI algorithm. The user must then use vpiImageCreatePyramidLevelWrapper specifying the pyramid and which level is to be wrapped. The returned VPIImage handle can be used as any other image. The resulting image inherits the enable backends from the pyramid. Once work on this image is done, it must be destroyed with vpiImageDestroy.
Refer to pyramid API reference for more information.
As with images and arrays, the user can access the whole pyramid data from host using the function vpiPyramidLock, provided that the pyramid is enabled for CPU backend. This function fills a VPIPyramidData structure that is basically an array of VPIImageData. Once work with VPIPyramidData is done, call vpiPyramidUnlock to unmap the pyramid from host and free resources. Recursive locking works just like images and arrays.
Each compute function in the API is executed asynchronously with respect to the calling thread, i.e., returns immediately without waiting for the completion. There are two ways of synchronizing with the backend. One is to wait until all the commands in the VPIStream queue are finished by using the vpiStreamSync call. This approach, while simple, doesn't allow for fine-grained (i.e. "wait until function X is completed") or inter-stream (i.e. "before running function A in stream B, wait until function C in stream D finishes") synchronization. That's where VPIEvent objects come in. Conceptually they correspond to binary semaphores and are designed to closely mimic events in CUDA API:
Refer to event API reference for more information.
All API functions are thread-safe. Concurrent host access to API objects is serialized and executed in an unspecified order. All API calls use a VPIContext instance that is thread-specific and stored in TLS. If the context pointer for the current thread is NULL (no context is set), all API calls will use a default "global" context created during library initialization. API objects have no concept of thread affinity; in other words, if both threads use the same context instance, the object created in one thread can be safely destroyed by another thread.
Most of the API functions are non-blocking. Specifically, the set of functions that can block when called is limited to: vpiStreamSync, vpiStreamDestroy, vpiContextDestroy, vpiEventSync and the several vpiSubmit*
functions when the stream command queue is full. Since implicit synchronization in the API implementation is minimal, it's up to the user to make sure the resulting order of dependent function calls is legal. Invalid calls, however, should be always handled gracefully (via an appropriate error code) and should not lead to application crashes or corruption of objects' internal state.
The device command queue model is loosely based on CUDA Stream API, and can be summarized with the following:
Pipeline examples, and how to implement them using VPI, are explained in the following sections.
In this example, a pipeline with a simple box filter operation is implemented to process an input image. This is quite similar to the ImageBlurring tutorial".
The code for implementing the pipeline is as follows.
Include necessary headers. In this example, image buffers are used, a stream, and the Box Filter algorithm.
Create the image buffers to be used.
A 640x480 1-channel (grayscale) input image is created with unsigned 8-bit pixel elements, represented by VPI_IMAGE_FORMAT_U8 enum. Images are initialized with zeros upon creation. By passing 0 as image flags, we inform the intent of using them possibly in all available hardware backends. This makes it easier to submit algorithms to different backends later on, despite of using more resources. The output image is created the same way.
Create a stream to execute the algorithm. Passing 0 as stream flags enables the user to submit the algorithm algorithms for execution in any available hardware backend, specified later.
Submit the box filter algorithm to the stream, along with the input and output images, and other parameters. In this case, it's a 3x3 box filter with clamp boundary condition. It'll be executed by the CUDA backend.
In general, because of the asynchronous nature of streams, the algorithm is enqueued onto the stream's work thread, and the function returns immediately. Later on it'll be submitted for execution in the actual backend. The use of a work thread allows the program to continue assembling the processing pipeline, or do something else, while the algorithm executes in parallel.
Wait until the stream finishes processing.
This function blocks until all algorithms submitted to the stream finish executing. This function must be called to show the output to the user, to save it to disk, etc.
Destroy created objects.
Upon completion, destroy the created objects to avoid memory leaks. Destroying a stream forces it to synchronize, but destroying images that are still being used by an algorithm leads to undefined behavior, likely resulting in a program crash.
Examining how several VPI objects work together, and inspecting the ownership relationship between objects is a beneficial learning exercise.
A conceptual structure of the provided example is as follows.
Where:
More complex scenarios can be envisioned that take advantage of different acceleration processors on the device and create a pipeline that strives to fully utilize the computational power. To do that, the pipeline must have parallelizable stages.
This next example implements a full stereo disparity estimation and Harris corners extraction pipeline, which presents plenty of parallelization opportunities.
Three stage parallelization opportunities are identified: the independent left and right image pre-processing, and Harris corners extraction. A different backend is chosen for each processing stage, depending on the processing speed of each backend, power requirements, input and output restrictions, and availability. In this example, the whole processing is split among the following backends:
The rationale for this choice of backends is to keep the GPU free for other external processing, such as Deep Learning inference stages, etc. The image format conversion operation is quite fast on CUDA, and wouldn't interfere much. The CPU is kept busy performing Harris keypoints undisturbed.
The following diagram shows how the algorithms are split into streams and how synchronization between streams works.
Both stream left and right start stereo pair pre-processing while the keypoints stream waits until the right grayscale image is ready. Once it is, Harris corners detection starts while stream right continues pre-processing. Once pre-process on the left stream ends, it waits until the right downscaled image is ready. Finally, stereo disparity estimation starts with its two stereo inputs. The host thread can at any point issue a vpiStreamSync call in both left and keypoints stream to wait until the disparity and keypoints data is ready for further processing or display.
The code that implements this pipeline is explained as follows.
Create a context and make it active.
Although the default context that is automatically created to manage the VPI state can be used, sometimes it is more convenient to create a context and use it to handle the lifetime of all objects linked to a particular pipeline. In the end, context destruction will trigger destruction of all objects created under it. This also leads to better isolation between this pipeline and others that the application might use.
Create the streams.
The streams are created with flags 0, meaning that they can handle tasks for all backends.
There are two streams to handle the stereo pair preprocessing, and another for Harris corners detection. After preprocessing is done, stream_left
is reused for stereo disparity estimation.
Create the input image buffer wrappers.
Assuming that the input comes from a capture pipeline as EGLImage
, these can be wrapped into a VPIImage to be used in a VPI pipeline. All it requires is one frame (usually the first) from each stereo input.
Create the image buffers to be used.
Similar to the simple pipeline, here the input images are created empty. In reality these input images must be populated by either wrapping existing memory, or by being the result of an earlier VPI pipeline.
The input is a 640x480 NV12 (color) stereo pair, tipically output by camera capture pipelines. The temporary images are needed for storing intermediate results. Stereo disparity and Harris expect grayscale images, hence the format conversion. Moreover, stereo disparity expects its input to be exactly 480x270. This is accomplished by the rescale stage in the diagram above.
Define stereo disparity algorithm parameters and create the payload.
Stereo disparity processing requires some temporary data. VPI calls it payload. In this example, vpiCreateStereoDisparityEstimator is called and passed all the required parameters by the internal allocator to decide the size of the temporary data.
Because the temporary data is allocated on a backend device, the payload is tightly coupled to the backend. If the same algorithm is meant to be executed in different backends, or concurrently using the same backend in different streams, it'll require one payload per backend/stream. In this example, the payload is created for execution by the PVA backend.
As for algorithm parameters, the VPI stereo disparity estimator is implemented by a semi-global stereo matching algorithm. The estimator requires the census transform window size, specified as 5, and the maximum disparity levels, specified as 64. For more information, consult Stereo Disparity Estimator.
Create the image rectification payload and corresponding parameters. It does lens distortion correction using the Remap algorithm. Here the stereo lens parameters are specified and because their are different for left and right lenses, two remap payloads are created. For more details, consult Lens Distortion Correction.
Create output buffers for Harris keypoint detector.
This algorithm receives an image and outputs two arrays, one with the keypoints themselves and another with the score of each keypoint. At most, 8192 keypoints are returned, which must be the array capacity. Keypoints are represented by the VPIKeypoint structure and scores are 32-bit unsigned values. For more information, consult Harris Corner Detector.
Define Harris detector parameters and create its payload.
Fill the VPIHarrisCornerDetectorParams structure with the required parameters. Refer to the structure documentation for more information about each parameter.
Like stereo disparity, Harris detector requires a payload. This time only the input size, 640x480 is needed. When using this payload, only inputs of this size are accepted.
Create the events to implement a barrier synchronization.
Events are used for inter-stream synchronization. They are implemented by using VPIEvent. Two barriers are needed: one to wait for the input to Harris corners extraction to be ready, and another for the pre-processed right image.
Submit the left frame processing stages.
The lens distortion correction, image format conversion and downscaling are submitted to the left stream. Note again that the submit operations are non-blocking and return immediately.
Submit the first few stages of the right frame pre-processing.
The lens distortion correction and image format conversion stages will result in the grayscale image that will be input to Harris corner extraction.
Record the right stream state so that keypoints stream can synchronize to it.
Keypoint stream can only start after its input is ready. For that, the barrier_right_grayscale event must record the right stream state by submitting a task to it that will signal the event right after the format conversion is finished.
Finish the right frame pre-processing with a downscale operation.
Record the right stream state so that left stream can synchronize to it.
With the whole right preprocessing submitted, the stream state must be recorded again so that the left stream can wait until the right frame is ready.
Make left stream wait until the right frame is ready.
Stereo disparity requires the left and right frames to be ready. vpiStreamWaitFor is used to submit a task to the left stream that will wait until the barrier_right_reduced
event is signaled on the right stream, meaning that the right frame preprocessing is finished.
Submit the stereo disparity algorithm.
At this point the input images are ready. Call vpiSubmitStereoDisparityEstimator to submit the disparty estimator.
Submit the keypoint detector pipeline.
For keypoint detection, first submit a wait operation on the barrier_right_grayscale
event so that it waits until the input is ready. Then submits the Harris corners detector on it.
Synchronize the streams to use the disparity map and keypoints detected.
Remember that the functions called so far in processing phase are all asynchronous; they return immediately once the job is queued on the stream for later execution.
Now, more processing can be performed on the main thread, such as updating some GUI status or showing the previous frame. This occurs while VPI is executing the pipeline. Once this additional processing is performed, synchronize the streams that are processing the final result from current frame using vpiStreamSync. Once completed, the resulting buffers can be accessed.
Fetch the next frame and update the input wrappers.
The existing input VPI images wrappers can be redefined to wrap the next stereo pair frames, provided that its dimensions and format are the same. This is done quite efficiently, without heap memory allocations.
Context destruction.
In this example, many objects were created under the current context. Once all processing is completed and the pipeline is no longer required, destroy the context. All streams will be synchronized and destroyed, along with all other objects used. No memory leaks are possible.
Destroying the current context activates the previous context that was active before the former was set to active.
Important takeaways from these examples: