FASTCornerDetector#

Overview#

The Features from Accelerated Segment Test (FAST) algorithm is a widely used method for detecting corners. Its main advantage is low computational complexity. The algorithm’s details can be found in [1], [2], and [3].

One of the main disadvantages of the FAST corner detector is multiple detections of corners in proximity; therefore, Non-Maximum Suppression (NMS) is often used to refine the output. Since the original algorithm only classifies pixels into corners and non-corners and does not compute a corner response function, NMS cannot be directly applied to the resulting corners. The current implementation follows OpenCV’s choice, which calculates the maximum threshold that makes the target pixel a FAST corner and uses this threshold (called the score) as the input for NMS. More details can be found in [4]. Other methods to calculate scores, such as computing the sum of absolute differences between surrounding pixels and center pixels, are also available in the literature.

Design Requirements#

Data type supports uint16_t.
Supports a circle radius of 3 and an arc length of 9.
NMS can be ON or OFF, but the code is optimized for the ON case.
The output location is X-Y interleaved float. Note that the order of locations is implementation-specific; additional sorting may be needed to compare locations from different implementations.
The image border extension type can be constant or replicate.

Implementation Details#

Quick Rejection#

To further reduce computational complexity, a quick rejection method is adopted. This method conducts an initial test to eliminate pixels that are not corners, allowing only non-rejected pixels to undergo further testing and score calculation if they are corners. The quick rejection method described in “High-speed test” of [2] refers to a configuration with a circle radius of 3 and an arc length of 12.

The proposed scheme tests pixel P1, P5, P9, and P13 (shown in the following figure) and rejects the central pixel if at least three of their intensities are within the range \([I_{p}-t, I_{p}+t]\), where \(I_{p}\) is the central pixel’s intensity and \(t\) is the threshold. This scheme has a lower rejection ratio compared to that in “High-speed test” of [2]. For an arc length of 9, it is possible for a candidate pixel to be a corner even if only 2 out of 4 testing pixels are both brighter than \(I_{p}+t\) or darker than \(I_{p}-t\). As the current implementation uses SIMD instructions to process 1x32 pixels as a chunk, it is also necessary to test a contiguous block of 1x32 pixels and reject them as a whole if all satisfy the rejection criteria. In other words, if there is even one pixel that does not meet the rejection criteria, this 1x32 pixels chunk will not be rejected.

../../_images/fastcorner-quick-rejection.png — Figure 1: Quick Rejection Illustration#

The rationale behind quick rejection is that corners are quite sparse in practical scenarios, e.g., less than 1% of the image’s total pixels. The time consumption after adopting quick rejection becomes quick_rejection_time + non_rejection_ratio * score_calculation_time, leading to two implications:

If quick_rejection_time is smaller than score_calculation_time, overall time consumption will be reduced using quick rejection when non_rejection_ratio is small.
In rare cases where non_rejection_ratio is relatively large (e.g., 10%), overall time consumption may increase compared to not using quick rejection.

Dataflow Configuration#

Use RDF dataflow to transfer tiles from DRAM to VMEM, similar to a 7x7 2D filtering process. For each tile, calculate scores and transfer them back from VMEM to DRAM as a height * width image in tight layout, with values representing scores. This step prepares for overlapping tiles required by NMS.
Use RDF dataflow again to transfer the score-valued image from DRAM to VMEM, akin to a 3x3 2D filtering operation. For each tile, perform NMS; for each non-zero NMS output, calculate its location within the entire image and output it in X-Y interleaved format. SQDF dataflow transfers these locations from VMEM to DRAM. The number of locations varies among tiles, necessitating a similar mechanism to maintain related states on both device and host sides. More information can be found in the documentation regarding MinMaxLoc operator’s output.

Note that the current implementation is always split into two stages, regardless of whether NMS is ON or OFF; it distinguishes processing during NMS while letting output equal input when NMS is OFF. Future releases will consider simplifying score calculation to classification and outputting corners at the end of step 1.

Performance#

ImageSize	DataType	TestData	Execution Time	Submit Latency	Total Power
1024x1024	U16	Random	2.911ms	0.018ms	11.612W
1920x1080	U16	Random	5.740ms	0.020ms	11.612W
1920x1080	U16	kodim08_grayscale_1920x1080_u16	3.632ms	0.019ms	11.045W

For detailed information on interpreting the performance table above and understanding the benchmarking setup, see Performance Benchmark.

Reference#

https://docs.opencv.org/4.x/df/d0c/tutorial_py_fast.html

https://en.wikipedia.org/wiki/Features_from_accelerated_segment_test

https://docs.nvidia.com/vpi/algo_fast_corners_detector.html#algo_fast_corners_detector_overview

https://stackoverflow.com/questions/67306891/algorithm-behind-score-calculation-in-fast-corner-detector