CornerSubPix#

Overview#

The CornerSubPix operator follows the OpenCV cornerSubPix function, which refines corner locations to sub-pixel accuracy. It is commonly used to improve the precision of corner detection results, especially in camera calibration and stereo vision applications. This method is essential for achieving high accuracy in various computer vision applications, ensuring that corner detection is not limited to pixel-level precision.

Algorithm#

Please refer to OpenCV cv::cornerSubPix for algorithm details.

Implementation Details#

The CornerSubPix operator modifies some operations into fixed-point math. Specifically, the internal image is in UQ8.8, the image gradients are in Q9.8, the gradient multiplication results are in Q17.16, and the mask is in Q1.15. The equation \(q = G^{-1} \cdot b\) is solved using float32 math. The resulting motion is then converted back to fixed-point in Q17.15.

The input corners and output corners are of type float but are converted to fixed-point in Q17.15 at the beginning and converted back to floating point at the end.

We process a batch of 32 corners simultaneously. Early termination is possible but must wait for all corners in a batch to converge. To hide DMA latency, DMA prefetches the next two batches of windows while the VPU is processing. The prefetch for the second batch ahead starts in the 6th iteration, as the previous 5 iterations can hide DMA latency for fetching one batch. It is derived from profiling the runtime of VPU kernels and assumes a DMA throughput of 10B/cycle.

Limitations#

There are some limitations on use cases due to the actual implementation:

We only support two options for winSize in the interface: 3 or 4. To be consistent with OpenCV, winSize is defined as the radius of the searching window. Additionally, a square searching window is assumed in the implementation. That means only 7x7 or 9x9 searching windows are supported.
The zero zone size should be smaller than the searching window size. The interface zeroZone is defined as half the size of a “dead region” in the middle of the search zone. It is sometimes used to avoid possible singularities of the autocorrelation matrix or to exclude noisy or unreliable pixel values immediately surrounding the corner point. There are no requirements for the zero zone width to be equal to the zero zone height.
It iterates for at least 6 iterations.
The capacity of feature buffers is set to host a maximum of 4096 features.

Parameters#

Those parameters will be copied to VMEM upon program command submission.

img_w: Image width.
img_h: Image height.
win_size: Window size. The interface winSize represents the radius of the search window. In this context, it refers to the size of the search window. We only support cases where the height and width are the same, specifically 7 and 9.
num_corners: Number of corners per frame.
max_iters: Maximum number of iterations. After this limit, the process of corner position refinement stops.
eps: Stop criteria. When the corner position moves by less than eps, the process of corner position refinement stops.
mask: Searching window mask. The size is the same as the searching window. It is generated during the operator creation stage. If a zero zone is defined, the mask incorporates the zero-zone region by zeroing out the mask elements inside the zero zone.

Dataflows#

SequenceDataFlow is used for transferring input/output arrays of corners. The tile size is twice the number of corners at runtime, formatted as pairs of x and y coordinates. GatherScatterDataFlow is used for DMA reading image windows. The numTilesPerTrigger is the same as batch size, which is 32.

Cupva Sampler#

CupvaSampler is used for 2D lookup with bilinear interpolation for sub-pixel window locations. It performs the lookup for a batch of search windows in one run. The input datatype is U16, and we pre-shift the U8 image left by 8 bits. For conflict-free lookup, the line pitch in each subtable must be 4k+2 (k being any integer). SAMPLER_TRAVERSAL_TRANSPOSED is set, meaning that tile pixels are looked up in a transposed pattern (y is the inner dimension). This transposed pattern allows each of the 2x2 lookups to access its own memory bank, thereby avoiding conflicts. The sampler output is stored into VMEM in T1 transposition mode, where a batch of 32 window pixels is stored interleavedly. This interleaved storage enables efficient vector processing of 32 windows.

Out-of-range is guaranteed not to happen because we will clamp the refined corners to prevent them from exceeding the allowed boundaries during the iterative searching process. The output is U16 and is in transposed mode for later vectorized operations. Each lane in a vector processes one window in a batch.

Reference Check#

PVA and CPU may exhibit minor floating-point differences after a single iteration. The difference tolerance for PVA and CPU subpixel corners is set to 1e-3. However, these differences can accumulate over multiple iterations and become significant. Outliers usually occur due to a large zero zone. If the outlier ratio is less than 1%, we consider it to pass the reference check.

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

ImageSize	DataType	NumCorners	WinSize	MaxIters	Execution Time	Submit Latency	Total Power
1024x768	U8	448	3x3	30	0.382ms	0.024ms	12.204W
1226x370	U8	1071	3x3	30	0.837ms	0.023ms	12.586W
768x512	U8	1480	3x3	30	1.157ms	0.023ms	12.586W