GaussianPyramid#

Overview#

The Gaussian Pyramid is a multi-scale representation technique that creates a hierarchical sequence of images by repeatedly applying Gaussian smoothing and downsampling. It is a fundamental building block in computer vision pipelines. The pyramid structure enables efficient processing at multiple resolutions while preserving important image features across different scales.

Input Image	Output Images

Gaussian Pyramid Parameters: levels = 3

Reference Implementation#

The reference implementation of this Gaussian pyramid algorithm is based on VPI [1].

The function is implemented by generating the Gaussian pyramid from the base (level 0) to coarser levels through an iterative process.

Level 0 Processing: Level 0 is identical to the input image. The input image contents are directly copied to level 0 of the image pyramid without any modification.
Coarser Level Generation: The coarser levels (level 1 and above) are generated through an iterative process that consists of two main steps:
1. Gaussian Convolution: Each level is generated by convolving the previous level using a clamp border extension with the following fixed 5x5 Gaussian kernel quantized in Q1.8 format:
  
  (1)#\[\begin{split}\begin{bmatrix} 1 & 4 & 6 & 4 & 1 \\ 4 & 16 & 24 & 16 & 4 \\ 6 & 24 & 36 & 24 & 6 \\ 4 & 16 & 24 & 16 & 4 \\ 1 & 4 & 6 & 4 & 1 \end{bmatrix}\end{split}\]
2. Downsampling: The implementation only supports 2x downscaling, where the convolved result is downsampled by keeping all pixels with even coordinates (i.e., pixels at positions \((2i, 2j)\) where \(i\) and \(j\) are integers).

The algorithm repeats this convolution and downsampling process until all specified pyramid levels are generated. Each level has approximately one-fourth the number of pixels as the previous level, creating a hierarchical multi-scale representation.

Implementation Details#

The PVA implementation fuses the computation of two consecutive Gaussian pyramid levels together to optimize memory bandwidth usage. If the levels=3, after level 0 image is loaded into VMEM, the implementation continuously computes and outputs both level 1 and level 2 images. This approach avoids the need to reload level 1 as input back to VMEM, significantly reducing memory bandwidth consumption and improves the overall performance.

Limitations#

The current implementation has the following limitations:

Input/Output Image Formats:

Supported formats: NVCV_IMAGE_FORMAT_U8, NVCV_IMAGE_FORMAT_S8, NVCV_IMAGE_FORMAT_Y8, NVCV_IMAGE_FORMAT_Y8_ER, NVCV_IMAGE_FORMAT_U16, NVCV_IMAGE_FORMAT_S16, NVCV_IMAGE_FORMAT_Y16, NVCV_IMAGE_FORMAT_Y16_ER

Image Size Constraints:

Minimum input image size: 2x4 pixels
Image height modulo tile height cannot be 1.

Pyramid Parameters:

Supported pyramid levels range: 2-3 levels
Border mode: only NVCV_BORDER_CONSTANT with zero-padding (border value must be 0)

Gaussian Kernel:

Kernel size: only supports 5x5
Fixed 5x5 Gaussian kernel coefficients as defined in (1)

Dataflow Configuration#

5 RasterDataFlow(RDF) are needed:

Level 0 Processing:

Level 0 Input RDFs: Two RDFs transfer even and odd rows of the input image from DRAM to VMEM with circular buffering and halo padding.
Level 1 Output RDFs: Two RDFs transfer even and odd rows of the output level 1 image from VMEM to DRAM with double buffering.
Level 2 Output RDF: One RDF transfers the output level 2 image from VMEM to DRAM with double buffering.

../../_images/gaussian-pyramid-dataflow.png — Figure 1: Gaussian Pyramid dataflow architecture with two-level fusion strategy.#

Buffer Allocation#

7 VMEM buffers are needed:

Input Buffers:

level0EvenBuffer: Circular buffer for even rows of level 0 input image
level0OddBuffer: Circular buffer for odd rows of level 0 input image

Output Buffers:

level1EvenBuffer: Double buffer for even rows of level 1 output image
level1OddBuffer: Double buffer for odd rows of level 1 output image
level2Buffer: Double buffer for level 2 output image

Kernel and Weight Buffers

weightBuffer: Reordered Gaussian kernel weights
gaussianKernel: 5x5 Gaussian kernel coefficients

Kernel Implementation#

The implementation uses the same fixed 5x5 Gaussian kernel quantized in Q1.8 format as defined in (1). To prevent bank conflicts and efficiently utilize the vdotp4x2 instruction for computation, even and odd input data rows are placed in different VMEM banks.

To process 2-level or 3-level Gaussian pyramids without unnecessary data recomputation, different approaches are taken for levels=2 and levels=3.

For levels=2 (generating 1 coarse level image): Level 0 RDF reads are configured with 2x2 halo for Gaussian kernel computation, then level 1 results are written back to DRAM normally.
For levels=3 (generating 2 coarse level images): Level 0 RDF reads are configured with 6x6 halo. This ensures that when level 1 is generated, it will have 2x2 halo for level 2 computation. Finally, RDF is used to output both level 1 and level 2 images. The RDF’s tileArena API allows us to output level 1 images without including the 2x2 halo content.

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores. Idle power is approximately 7W when the PVA is not processing data.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

ImageSize	ImageFormat	Levels	Execution Time	Submit Latency	Total Power
480x270	U8	2	0.032ms	0.038ms	12.504W
1920x1080	U8	2	0.165ms	0.194ms	16.4W
480x270	U8	3	0.038ms	0.038ms	12.604W
1920x1080	U8	3	0.214ms	0.205ms	16.611W
480x270	U16	3	0.048ms	0.050ms	13.568W
1920x1080	U16	3	0.322ms	0.427ms	15.637W

Reference#

NVIDIA VPI Documentation: https://docs.nvidia.com/vpi/algo_gaussian_pyramid_generator.html