HOG#
Overview#
The Histogram of Oriented Gradients (HOG) is a feature descriptor commonly used for object detection. It counts occurrences of gradient orientations within localized regions of an image. While it is related to other edge-orientation histogram descriptors, HOG is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization to improve robustness [1].
Reference Implementation#
This is a dedicated version of HOG intended to serve DCF Tracker [2]. Each image is processed through the following stages: gradient magnitude and orientation bin computation, histogram voting, histogram consolidation, block normalization and coefficient generation. The complete dataflow with intermediate data and corresponding PVA implementation is illustrated in Figure 1.
Implementation Details#
Limitations#
Only supports RGB8p input image format and CHW output tensor format.
Only supports uint8 input and output data types.
Parameters and sizes are fixed (compile-time macros):
Orientations per cell: 18 (ORIENTATIONS_NUM)
Cell size: 4×4 pixels (CELL_SIZE)
Cells per image: 24×24 (FEATURE_WIDTH, FEATURE_HEIGHT)
Concatenated image grid: 6×6 (36 images) (B_WIDTH, B_HEIGHT)
Input size: 576×576
Output size: 18×144×144 (ORIENTATIONS_NUM, FULL_OUT_HEIGHT, FULL_OUT_WIDTH)
Dataflow Configuration#
4 SequenceDataFlow(SQDF) are needed:
3 input image SQDFs are used to transfer input planes (R/G/B) from DRAM into VMEM.
1 output tensor SQDF is used to transfer the HOG features from VMEM back to DRAM.
Buffer Allocation#
9 VMEM buffers are needed:
input_r_v: double-buffered RGB input plane (R).
input_g_v: double-buffered RGB input plane (G).
input_b_v: double-buffered RGB input plane (B).
subHistOutput: buffer overlay for intermediate results (magnitude/bin and consolidated histogram).
norm: fp32 normalization buffer.
hist0: partial 8-way histogram buffer (one half, mapped to one superbank).
hist1: partial 8-way histogram buffer (the other half, mapped to another superbank).
coeff: normalization coefficients used to generate final HOG features.
output_v: double-buffered HOG feature output.
Kernel implementation#
The kernel processes 36 images concatenated in a 6×6 grid. For each image, the following stages are executed:
Figure 1: HOG dataflow.#
Boundary pixel extension (produceBPEExec): Perform padding in place for each RGB plane.
Gradient magnitude and binning (calcMagBinExec): Compute horizontal and vertical differences for R/G/B, select the channel with the maximum gradient magnitude per pixel, compute the orientation bin.
Histogram voting (calcHistExec): Accumulate per-pixel magnitudes into 18-bin cell histograms with bilinear interpolation. The implementation uses 8-way parallel histogram updates and splits results into hist0 and hist1 to fit VMEM and utilize superbanks.
8-way consolidation (consolidateHistExec): Reduce the 8-way histograms into packed CHW uint32 histograms in subHistOutput.hist. The consolidated planes are padded to multiples of 64 bytes (HIST_SIZE_P) to simplify vector loads/stores and avoid bank issues.
Block normalization (normExec): Compute L2 norm terms for 2×2 cell blocks. Histogram values are stored in fixed-point and converted to fp32 for norm computation.
Coefficient generation (genCoeffExec): Generate four normalization coefficients from the neighborhood of norms.
Feature output (genHogFeaturesExec): Dequantize histograms, apply normalization and scaling.
Performance#
Execution Time is the average time required to execute the operator on a single VPU core.
Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.
Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores.
Idle power is approximately 7W when the PVA is not processing data.
For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.
ImageSize |
ImageFormat |
Execution Time |
Submit Latency |
Total Power |
|---|---|---|---|---|
576x576 |
RGB8p |
1.703ms |
0.026ms |
10.427W |
Reference#
NVIDIA VPI Documentation: https://docs.nvidia.com/vpi/algo_dcf_tracker.html