Performance Comparison

In this section we compare VPI's performance with other well-known Computer Vision libraries. Performance numbers were collected following the method described in Benchmarking Method.

Benchmarking was done on NVIDIA® Jetson AGX Xavier™ devices, with clock frequencies maxed out.

The numbers show that VPI provides a significant speed up in many use cases.

Comparison made with OpenCV 4.1.1, but built with NVIDIA® CUDA® support enabled. This version matches the OpenCV version shipped with NVIDIA® JetPack™.

All plots use logarithmic scale due to the large difference between different algorithm performance numbers.

Both OpenCV and VPI measurements are done using one dispatch thread. Many OpenCV algorithms once dispatched make use of multiple CPU cores during execution. Similarly VPI will take advantage of all CPU cores available

Jetson AGX Xavier CPU comes with eight cores.

The performance drop on Convert Image Format, Perspective Warp and Remap will be addressed in a future VPI release.

Algorithm | Parameters | OpenCV 4.1.1 CPU | VPI 1.0 CPU | Speed-up |
---|---|---|---|---|

Gaussian Filter | 1920x1080 U8 3x3 | 0.32 ms | 0.27 ms | 1.2x |

Box Filter | 1920x1080 U8 3x3 | 1.53 ms | 0.38 ms | 4x |

Bilateral Filter | 1920x1080 U8 5x5 | 8.8 ms | 3.4 ms | 2.6x |

Convolution | 1920x1080 U8 7x7 | 27.766 ms | 1.45 ms | 19x |

Separable Convolution | 1920x1080 U8 11x11 | 37.7 ms | 1.286 ms | 29x |

Gaussian Pyramid | 1920x1080 U8 5 levels, dyadic | 2.18 ms | 0.7 ms | 3x |

Image Rescale | 1920x1080->1280x720 U8, linear interp. | 1.94 ms | 0.73 ms | 2.7x |

FFT | 1920x1080 Real->Complex | 42.2 ms | 7.1 ms | 6x |

Inverse FFT | 1920x1080 Complex->Real | 99.8 ms | 6.3 ms | 16x |

Harris Corner Detector | 1920x1080 U8 grad=5x5, win=5x5 | 58.4 ms | 8.1 ms | 7.2x |

Convert Image Format | 1920x1080 NV12->RGBA8 | 1.35 ms | 3.3 ms | 0.4x |

Perspective Warp | 1920x1080 RGBA8, linear interp. | 6.1 ms | 7.7 ms | 0.8x |

Perspective Warp | 1920x1080 RGBA8 dense, linear interp. | 5.0 ms | 11.36 ms | 0.4x |

Both OpenCV and VPI benchmarking use one stream for algorithm execution.

Algorithm | Parameters | OpenCV 4.1.1 CUDA | VPI 1.0 CUDA | Speed-up |
---|---|---|---|---|

Gaussian Filter | 1920x1080 U8 3x3 | 0.27 ms | 0.065 ms | 4.2x |

Box Filter | 1920x1080 U8 3x3 | 0.28 ms | 0.064 ms | 4.3x |

Bilateral Filter | 1920x1080 U8 5x5 | 1.37 ms | 0.22 ms | 6.4x |

Convolution | 1920x1080 U8 7x7 | 1.89 ms | 0.12 ms | 16x |

Separable Convolution | 1920x1080 U8 11x11 | 0.42 ms | 0.10 ms | 4.2x |

Gaussian Pyramid | 1920x1080 U8 5 levels, dyadic | 0.22 ms | 0.08 ms | 2.9x |

Image Rescale | 1920x1080->1280x720 U8, linear interp. | 0.08 ms | 0.05 ms | 1.8x |

FFT | 1920x1080 Real->Complex | 3.14 ms | 0.80 ms | 4.0x |

Inverse FFT | 1920x1080 Complex->Real | 3.40 ms | 0.82 ms | 4.1x |

Perspective Warp | 1920x1080 RGBA8, linear interp. | 0.44 ms | 0.18 ms | 2.4x |

Perspective Warp | 1920x1080 RGBA8 dense, linear interp. | 0.49 ms | 0.38 ms | 1.5x |

- Note
- In a previous version of this document, VPI/CUDA FFT performance numbers were shown to be worse than OpenCV/CUDA. The bug is now fixed so that the numbers accurately reflect the true performance difference. There has been no actual change in VPI/CUDA FFT performance from the prior version.

Here PVA processing is compared against OpenCV algorithms implemented on CPU.

The PVA hardware in Jetson AGX Xavier devices is capable of processing 4 independent streams, whereas for some OpenCV CPU implementations, the limit on these devices is the number of embedded CPU cores, 8. For this reason, the comparison below is made using 1, 4 and 8 parallel streams, to better capture each ones characteristics.

PVA saturates with 4 parallel streams, whereas CPU saturates with 8 if OpenCV implementation is single-threaded, or just 1 if it's multi-threaded.

Algorithm | Parameters | OpenCV 4.1.1 CPU | VPI 1.0 PVA | Speed-up | ||||||
---|---|---|---|---|---|---|---|---|---|---|

1 stream | 4 streams | 8 streams | 1 stream | 4 streams | 8 streams | 1 stream | 4 streams | 8 streams | ||

Gaussian Filter | 1920x1080 U8 3x3 | 0.57 ms | 0.16 ms | 3.00 ms | 1.01 ms | 1.25 ms | 2.41 ms | 0.6x | 1.3x | 1.2x |

Box Filter | 1920x1080 U8 3x3 | 1.50 ms | 1.54 ms | 2.29 ms | 1.12 ms | 1.26 ms | 2.45 ms | 1.3x | 1.2x | 0.9x |

Convolution | 1920x1080 U8 7x7 | 27.75 ms | 27.90 ms | 40.00 ms | 1.86 ms | 2.04 ms | 3.99 ms | 14.9x | 13.7x | 10x |

Separable Convolution | 1920x1080 S16 11x11 | 40.20 ms | 40.40 ms | 53.00 ms | 4.81 ms | 5.01 ms | 9.89 ms | 8.4x | 8.1x | 5.4x |

Gaussian Pyramid | 1920x1080 U16 5 levels, dyadic | 2.02 ms | 2.07 ms | 3.10 ms | 1.17 ms | 1.38 ms | 2.55 ms | 1.7x | 1.5x | 1.2x |

Comparison made against NVIDIA® VisionWorks™ version 1.6. Both OpenCV and VPI benchmarking use one stream for algorithm execution.

Convert Image Format and Box Filter VPI performance drop will be addressed in a future VPI release.

Algorithm | Parameters | VisionWorks 1.6 | VPI 1.0 CUDA | Speed-up |
---|---|---|---|---|

Gaussian Filter | 1920x1080 U8 3x3 | 0.063 ms | 0.065 ms | 0.96x |

Box Filter | 1920x1080 U8 3x3 | 0.052 ms | 0.064 ms | 0.8x |

Convolution | 1920x1080 U8 7x7 | 0.62 ms | 0.12 ms | 5.2x |

Gaussian Pyramid | 1920x1080 U8 5 levels, dyadic | 0.113 ms | 0.077 ms | 1.46x |

Image Rescale | 1920x1080->1280x720 U8, linear interp. | 0.044 ms | 0.045 ms | 0.97x |

Harris Corner Detection | 1920x1080 U8 grad=5x5, win=5x5 | 5.51 ms | 0.84 ms | 6.54x |

Convert Image Format | 1920x1080 NV12->RGBA8 | 0.11 ms | 0.19 ms | 0.57x |

Perspective Warp | 1920x1080 RGBA8, linear interp. | 0.19 ms | 0.18 ms | 1.06x |

Image Remap | 1920x1080 RGBA8 dense, linear interp. | 0.36 ms | 0.33 ms | 1.08x |

Stereo Disparity Estimator | 480x270 U8, max disp=64, win=5x5 | 7.72 ms | 5.91 ms | 1.31x |