VPI - Vision Programming Interface

1.2 Release

Performance Comparison

In this section we compare VPI's performance with other well-known Computer Vision libraries. Performance numbers were collected following the method described in Benchmarking Method.

Benchmarking was done on NVIDIA® Jetson AGX Xavier™ devices, with clock frequencies maxed out.

The numbers show that VPI provides a significant speed up in many use cases.

OpenCV

Comparison made with OpenCV 4.1.1, but built with NVIDIA® CUDA® support enabled. This version matches the OpenCV version shipped with NVIDIA® JetPack™.

All plots use logarithmic scale due to the large difference between different algorithm performance numbers.

CPU Performance

Both OpenCV and VPI measurements are done using one dispatch thread. Many OpenCV algorithms once dispatched make use of multiple CPU cores during execution. Similarly VPI will take advantage of all CPU cores available

Jetson AGX Xavier CPU comes with eight cores.

The performance drop on Convert Image Format, Perspective Warp and Remap will be addressed in a future VPI release.

OpenCV vs. VPI - CPU performance
Algorithm Parameters OpenCV 4.1.1 CPU VPI 1.0 CPU Speed-up
Gaussian Filter 1920x1080 U8 3x3 0.32 ms 0.27 ms 1.2x
Box Filter 1920x1080 U8 3x3 1.53 ms 0.38 ms 4x
Bilateral Filter 1920x1080 U8 5x5 8.8 ms 3.4 ms 2.6x
Convolution 1920x1080 U8 7x7 27.766 ms 1.45 ms 19x
Separable Convolution 1920x1080 U8 11x11 37.7 ms 1.286 ms 29x
Gaussian Pyramid 1920x1080 U8 5 levels, dyadic 2.18 ms 0.7 ms 3x
Image Rescale 1920x1080->1280x720 U8, linear interp. 1.94 ms 0.73 ms 2.7x
FFT 1920x1080 Real->Complex 42.2 ms 7.1 ms 6x
Inverse FFT 1920x1080 Complex->Real 99.8 ms 6.3 ms 16x
Harris Corner Detector 1920x1080 U8 grad=5x5, win=5x5 58.4 ms 8.1 ms 7.2x
Convert Image Format 1920x1080 NV12->RGBA8 1.35 ms 3.3 ms 0.4x
Perspective Warp 1920x1080 RGBA8, linear interp. 6.1 ms 7.7 ms 0.8x
Perspective Warp 1920x1080 RGBA8 dense, linear interp. 5.0 ms 11.36 ms 0.4x

CUDA Performance

Both OpenCV and VPI benchmarking use one stream for algorithm execution.

OpenCV vs. VPI - CUDA performance
Algorithm Parameters OpenCV 4.1.1 CUDA VPI 1.0 CUDA Speed-up
Gaussian Filter 1920x1080 U8 3x3 0.27 ms 0.065 ms 4.2x
Box Filter 1920x1080 U8 3x3 0.28 ms 0.064 ms 4.3x
Bilateral Filter 1920x1080 U8 5x5 1.37 ms 0.22 ms 6.4x
Convolution 1920x1080 U8 7x7 1.89 ms 0.12 ms 16x
Separable Convolution 1920x1080 U8 11x11 0.42 ms 0.10 ms 4.2x
Gaussian Pyramid 1920x1080 U8 5 levels, dyadic 0.22 ms 0.08 ms 2.9x
Image Rescale 1920x1080->1280x720 U8, linear interp. 0.08 ms 0.05 ms 1.8x
FFT 1920x1080 Real->Complex 3.14 ms 0.80 ms 4.0x
Inverse FFT 1920x1080 Complex->Real 3.40 ms 0.82 ms 4.1x
Perspective Warp 1920x1080 RGBA8, linear interp. 0.44 ms 0.18 ms 2.4x
Perspective Warp 1920x1080 RGBA8 dense, linear interp. 0.49 ms 0.38 ms 1.5x
Note
In a previous version of this document, VPI/CUDA FFT performance numbers were shown to be worse than OpenCV/CUDA. The bug is now fixed so that the numbers accurately reflect the true performance difference. There has been no actual change in VPI/CUDA FFT performance from the prior version.

PVA Performance

Here PVA processing is compared against OpenCV algorithms implemented on CPU.

The PVA hardware in Jetson AGX Xavier devices is capable of processing 4 independent streams, whereas for some OpenCV CPU implementations, the limit on these devices is the number of embedded CPU cores, 8. For this reason, the comparison below is made using 1, 4 and 8 parallel streams, to better capture each ones characteristics.

PVA saturates with 4 parallel streams, whereas CPU saturates with 8 if OpenCV implementation is single-threaded, or just 1 if it's multi-threaded.

OpenCV/CPU vs. PVA
Algorithm Parameters OpenCV 4.1.1 CPU VPI 1.0 PVA Speed-up
1 stream 4 streams 8 streams 1 stream 4 streams 8 streams 1 stream 4 streams 8 streams
Gaussian Filter 1920x1080 U8 3x3 0.57 ms 0.16 ms 3.00 ms 1.01 ms 1.25 ms 2.41 ms 0.6x 1.3x 1.2x
Box Filter 1920x1080 U8 3x3 1.50 ms 1.54 ms 2.29 ms 1.12 ms 1.26 ms 2.45 ms 1.3x 1.2x 0.9x
Convolution 1920x1080 U8 7x7 27.75 ms 27.90 ms 40.00 ms 1.86 ms 2.04 ms 3.99 ms 14.9x 13.7x 10x
Separable Convolution 1920x1080 S16 11x11 40.20 ms 40.40 ms 53.00 ms 4.81 ms 5.01 ms 9.89 ms 8.4x 8.1x 5.4x
Gaussian Pyramid 1920x1080 U16 5 levels, dyadic 2.02 ms 2.07 ms 3.10 ms 1.17 ms 1.38 ms 2.55 ms 1.7x 1.5x 1.2x

NVIDIA VisionWorks

Comparison made against NVIDIA® VisionWorks™ version 1.6. Both OpenCV and VPI benchmarking use one stream for algorithm execution.

Convert Image Format and Box Filter VPI performance drop will be addressed in a future VPI release.

VisionWorks vs. VPI - CUDA performance
Algorithm Parameters VisionWorks 1.6 VPI 1.0 CUDA Speed-up
Gaussian Filter 1920x1080 U8 3x3 0.063 ms 0.065 ms 0.96x
Box Filter 1920x1080 U8 3x3 0.052 ms 0.064 ms 0.8x
Convolution 1920x1080 U8 7x7 0.62 ms 0.12 ms 5.2x
Gaussian Pyramid 1920x1080 U8 5 levels, dyadic 0.113 ms 0.077 ms 1.46x
Image Rescale 1920x1080->1280x720 U8, linear interp. 0.044 ms 0.045 ms 0.97x
Harris Corner Detection 1920x1080 U8 grad=5x5, win=5x5 5.51 ms 0.84 ms 6.54x
Convert Image Format 1920x1080 NV12->RGBA8 0.11 ms 0.19 ms 0.57x
Perspective Warp 1920x1080 RGBA8, linear interp. 0.19 ms 0.18 ms 1.06x
Image Remap 1920x1080 RGBA8 dense, linear interp. 0.36 ms 0.33 ms 1.08x
Stereo Disparity Estimator 480x270 U8, max disp=64, win=5x5 7.72 ms 5.91 ms 1.31x