In this section we compare VPI's performance with other well-known Computer Vision libraries. Performance numbers were collected following the method described in Benchmarking Method.

Benchmarking was done on NVIDIA® Jetson AGX Xavier™ devices, with clock frequencies maxed out.

The numbers show that VPI provides a significant speed up in many use cases.

OpenCV

Comparison made with OpenCV 4.1.1, but built with NVIDIA® CUDA® support enabled. This version matches the OpenCV version shipped with NVIDIA® JetPack™.

All plots use logarithmic scale due to the large difference between different algorithm performance numbers.

CPU Performance

Both OpenCV and VPI measurements are done using one dispatch thread. Many OpenCV algorithms once dispatched make use of multiple CPU cores during execution. Similarly VPI will take advantage of all CPU cores available

Jetson AGX Xavier CPU comes with eight cores.

The performance drop on Convert Image Format, Perspective Warp and Remap will be addressed in a future VPI release.

OpenCV vs. VPI - CPU performance
Algorithm	Parameters	OpenCV 4.1.1 CPU	VPI 1.0 CPU	Speed-up
Gaussian Filter	1920x1080 U8 3x3	0.32 ms	0.27 ms	1.2x
Box Filter	1920x1080 U8 3x3	1.53 ms	0.38 ms	4x
Bilateral Filter	1920x1080 U8 5x5	8.8 ms	3.4 ms	2.6x
Convolution	1920x1080 U8 7x7	27.766 ms	1.45 ms	19x
Separable Convolution	1920x1080 U8 11x11	37.7 ms	1.286 ms	29x
Gaussian Pyramid	1920x1080 U8 5 levels, dyadic	2.18 ms	0.7 ms	3x
Image Rescale	1920x1080->1280x720 U8, linear interp.	1.94 ms	0.73 ms	2.7x
FFT	1920x1080 Real->Complex	42.2 ms	7.1 ms	6x
Inverse FFT	1920x1080 Complex->Real	99.8 ms	6.3 ms	16x
Harris Corner Detector	1920x1080 U8 grad=5x5, win=5x5	58.4 ms	8.1 ms	7.2x
Convert Image Format	1920x1080 NV12->RGBA8	1.35 ms	3.3 ms	0.4x
Perspective Warp	1920x1080 RGBA8, linear interp.	6.1 ms	7.7 ms	0.8x
Perspective Warp	1920x1080 RGBA8 dense, linear interp.	5.0 ms	11.36 ms	0.4x

CUDA Performance

Both OpenCV and VPI benchmarking use one stream for algorithm execution.

OpenCV vs. VPI - CUDA performance
Algorithm	Parameters	OpenCV 4.1.1 CUDA	VPI 1.0 CUDA	Speed-up
Gaussian Filter	1920x1080 U8 3x3	0.27 ms	0.065 ms	4.2x
Box Filter	1920x1080 U8 3x3	0.28 ms	0.064 ms	4.3x
Bilateral Filter	1920x1080 U8 5x5	1.37 ms	0.22 ms	6.4x
Convolution	1920x1080 U8 7x7	1.89 ms	0.12 ms	16x
Separable Convolution	1920x1080 U8 11x11	0.42 ms	0.10 ms	4.2x
Gaussian Pyramid	1920x1080 U8 5 levels, dyadic	0.22 ms	0.08 ms	2.9x
Image Rescale	1920x1080->1280x720 U8, linear interp.	0.08 ms	0.05 ms	1.8x
FFT	1920x1080 Real->Complex	3.14 ms	0.80 ms	4.0x
Inverse FFT	1920x1080 Complex->Real	3.40 ms	0.82 ms	4.1x
Perspective Warp	1920x1080 RGBA8, linear interp.	0.44 ms	0.18 ms	2.4x
Perspective Warp	1920x1080 RGBA8 dense, linear interp.	0.49 ms	0.38 ms	1.5x

Note: In a previous version of this document, VPI/CUDA FFT performance numbers were shown to be worse than OpenCV/CUDA. The bug is now fixed so that the numbers accurately reflect the true performance difference. There has been no actual change in VPI/CUDA FFT performance from the prior version.

PVA Performance

Here PVA processing is compared against OpenCV algorithms implemented on CPU.

The PVA hardware in Jetson AGX Xavier devices is capable of processing 4 independent streams, whereas for some OpenCV CPU implementations, the limit on these devices is the number of embedded CPU cores, 8. For this reason, the comparison below is made using 1, 4 and 8 parallel streams, to better capture each ones characteristics.

PVA saturates with 4 parallel streams, whereas CPU saturates with 8 if OpenCV implementation is single-threaded, or just 1 if it's multi-threaded.

OpenCV/CPU vs. PVA
Algorithm	Parameters	OpenCV 4.1.1 CPU			VPI 1.0 PVA			Speed-up
Algorithm	Parameters	1 stream	4 streams	8 streams	1 stream	4 streams	8 streams	1 stream	4 streams	8 streams
Gaussian Filter	1920x1080 U8 3x3	0.57 ms	0.16 ms	3.00 ms	1.01 ms	1.25 ms	2.41 ms	0.6x	1.3x	1.2x
Box Filter	1920x1080 U8 3x3	1.50 ms	1.54 ms	2.29 ms	1.12 ms	1.26 ms	2.45 ms	1.3x	1.2x	0.9x
Convolution	1920x1080 U8 7x7	27.75 ms	27.90 ms	40.00 ms	1.86 ms	2.04 ms	3.99 ms	14.9x	13.7x	10x
Separable Convolution	1920x1080 S16 11x11	40.20 ms	40.40 ms	53.00 ms	4.81 ms	5.01 ms	9.89 ms	8.4x	8.1x	5.4x
Gaussian Pyramid	1920x1080 U16 5 levels, dyadic	2.02 ms	2.07 ms	3.10 ms	1.17 ms	1.38 ms	2.55 ms	1.7x	1.5x	1.2x

NVIDIA VisionWorks

Comparison made against NVIDIA® VisionWorks™ version 1.6. Both OpenCV and VPI benchmarking use one stream for algorithm execution.

Convert Image Format and Box Filter VPI performance drop will be addressed in a future VPI release.

VisionWorks vs. VPI - CUDA performance
Algorithm	Parameters	VisionWorks 1.6	VPI 1.0 CUDA	Speed-up
Gaussian Filter	1920x1080 U8 3x3	0.063 ms	0.065 ms	0.96x
Box Filter	1920x1080 U8 3x3	0.052 ms	0.064 ms	0.8x
Convolution	1920x1080 U8 7x7	0.62 ms	0.12 ms	5.2x
Gaussian Pyramid	1920x1080 U8 5 levels, dyadic	0.113 ms	0.077 ms	1.46x
Image Rescale	1920x1080->1280x720 U8, linear interp.	0.044 ms	0.045 ms	0.97x
Harris Corner Detection	1920x1080 U8 grad=5x5, win=5x5	5.51 ms	0.84 ms	6.54x
Convert Image Format	1920x1080 NV12->RGBA8	0.11 ms	0.19 ms	0.57x
Perspective Warp	1920x1080 RGBA8, linear interp.	0.19 ms	0.18 ms	1.06x
Image Remap	1920x1080 RGBA8 dense, linear interp.	0.36 ms	0.33 ms	1.08x
Stereo Disparity Estimator	480x270 U8, max disp=64, win=5x5	7.72 ms	5.91 ms	1.31x

VPI - Vision Programming Interface

1.2 Release

OpenCV

CPU Performance

CUDA Performance

PVA Performance

NVIDIA VisionWorks