Direction of Arrival (DoA) 2D Angle FFT#

Overview#

2D Angle FFT is a radar signal processing operator that estimates the azimuth and elevation angles of detected targets from antenna array snapshots using 2D FFT-based beamforming. Unlike the DA-FFT operator, which uses dual-aperture interferometry (upper/lower apertures and phase difference for elevation), the 2D Angle FFT operator treats the virtual antenna array as a 2D spatial grid (azimuth × elevation) and performs a 2D spatial FFT on each detection snapshot. The peak in the 2D power spectrum corresponds to the target angle; sub-bin accuracy is achieved via quadratic interpolation.

The algorithm is based on the following principles:

2D Spatial Sampling: The snapshot is a 2D complex array (e.g., 16 × 4 virtual elements: azimuth × elevation). Each dimension is sampled at regular spacing, forming a spatial signal that can be analyzed with a 2D FFT.
Spatial Frequency and Angle: FFT bin indices in azimuth and elevation map to spatial frequencies, which correspond to physical angles via \(\theta = \arcsin((n/N) \cdot 2 - 1)\) (normalized bin index to angle).
Peak Detection and Interpolation: The maximum in the 2D power spectrum gives the coarse angle bin; quadratic interpolation on a 3×3 neighborhood yields sub-bin peak location, which is then converted to angle using pre-computed lookup tables.

Algorithm Description#

1. Fundamental Principle#

2D Spatial Array and FFT

The input snapshot is a 2D complex array of virtual antenna responses: azimuth (horizontal) and elevation (vertical). The array layout assumes that the distance between adjacent antennas in both azimuth and elevation dimensions is half the wavelength (λ/2). This spacing is critical for unambiguous angle estimation and directly links the spatial sampling to the FFT angle mapping.

Key relationship (per dimension): θ → spatial frequency k → FFT bin index n

Angle from FFT bin (per dimension)

For a padded FFT size \(N\) and bin index \(i\) (0 to N-1), the normalized index used for angle is:

(1)#\[\text{i_normalized} = 1 - 2 \cdot \frac{i}{N}\]

Then:

(2)#\[\theta = \arcsin(\text{i_normalized}) \quad \text{(radians)} \quad \Rightarrow \quad \theta_{\text{deg}} = \theta \cdot 180/\pi\]

So each FFT bin index maps to a unique angle in the field of view; the mapping is pre-computed in the angle bins lookup table (Q16 fixed-point).

2. Algorithm Stages#

2D FFT and power spectrum

Windowing: A 2D Hanning window is applied to the snapshot to reduce sidelobes (element-wise multiply, fixed-point).
Azimuth FFT: 1D FFT along the azimuth (horizontal) dimension (Radix-4 DIF, fixed-point).
Transpose: Data is transposed so that the elevation dimension is contiguous for the next FFT.
Elevation FFT: 1D FFT along the elevation (vertical) dimension.
Power spectrum: Magnitude squared is computed at each 2D bin: \(P = |\text{Re}|^2 + |\text{Im}|^2\).

Peak location and quadratic interpolation

The bin with maximum power gives the coarse azimuth and elevation indices \((i_{\text{az}}, i_{\text{el}})\). For sub-bin accuracy, a 3×3 neighborhood around the peak is used for quadratic interpolation in each dimension:

(3)#\[p = -\frac{b}{2a} = \frac{Y_{-1} - Y_{+1}}{2(2Y_0 - Y_{-1} - Y_{+1})}\]

where \(Y_{-1}, Y_0, Y_{+1}\) are the power values at the previous, center, and next bins along the dimension. The interpolated index is \(i + p\) (in fractional form, stored as Q16). This is done separately for azimuth and elevation.

Angle and power output

Angle bins LUT: Pre-computed arrays map FFT bin index (or interpolated fractional index) to angle in degrees (Q16). The VPU uses 1D samplers to convert interpolated indices to angles.
Power: The peak power value is converted to dB scale for the output target angles (azimuth, elevation, power_dB).

3. Algorithm Workflow#

The processing pipeline is:

Load coefficients: Twiddle factors (azimuth and elevation FFT), 2D Hanning window, and angle bins LUT are transferred to VMEM via SQDF.
For each detection snapshot (double-buffered):
1. Transfer snapshot: One snapshot tile (e.g., 16 × 4 complex = 4 × 16 in height × width for VPU layout) is loaded via SQDF.
2. 2D Hanning window: Element-wise multiply with pre-computed 2D windowing coefficients.
3. Azimuth FFT: 1D Radix-4 FFT along azimuth, then transpose.
4. Elevation FFT: 1D Radix-4 FFT along elevation.
5. Power spectrum: Compute magnitude squared at each 2D bin.
6. Peak detection: Find the maximum value and its 2D index (azimuth bin, elevation bin).
7. Gather 3×3 neighborhood: Read power values around the peak for interpolation.
8. Quadratic interpolation: Compute sub-bin offset for azimuth and elevation using the parabola formula above.
9. Angle conversion: Use angle bins LUT (via samplers) to convert interpolated indices to azimuth and elevation in degrees; convert peak power to dB.
10. Write outputs: Target angles (azimuth, elevation, power_dB) and target index map (azimuth index, elevation index) are written to DRAM via SQDF.

Design Requirements#

The number of virtual azimuth elements is fixed to 16.
The number of virtual elevation elements is fixed to 4.
The number of Tx antennas must be 8 and the number of Rx antennas must be 8 (snapshot dimensions 4 × 16).
Angle estimation precision selects the FFT padding factor: LOW = 2, DEFAULT = 4, HIGH = 8. Thus padded azimuth FFT size is 32, 64, or 128 and padded elevation FFT size is 8, 16, or 32.
Maximum number of targets (detections) per call is 8192.

Implementation Details#

Dataflow Configuration#

Use 1 SQDF to transfer coefficients: twiddle factors (azimuth + elevation), 2D windowing coefficients, angle bins LUT, and detection count.
Use 1 SQDF to transfer input snapshots from DRAM to VMEM (one tile per detection).
Use 1 SQDF to transfer output target angles (azimuth, elevation, power_dB) from VMEM to DRAM.
Use 1 SQDF to transfer output target index map (azimuth index, elevation index) from VMEM to DRAM.
Use 1 SQDF to transfer output target count to DRAM.

All dataflows are Sequence Data Flow (SQDF); there is no GSDF in the 2D Angle FFT kernel itself.

Batch size and invocation#

The 2D Angle FFT kernel is designed for a batch size of 8 detections (OUTPUT_BATCH_SIZE = 8). All angle-FFT stages except 2D FFT like peak detection, quadratic interpolation, angle conversion operate on 8 detections at a time for vectorization. When the number of detections exceeds 8, the same batched kernel logic runs multiple times within a single VPU program: the main loop advances in steps of 8, and each iteration processes one batch of 8 snapshots and writes one batch of 8 target angles and 8 target index map entries. For example, with 64 detections, the kernel runs 8 batch iterations (64 ÷ 8 = 8); with 71 detections, it runs 9 iterations (the last iteration processes 7 detections and the batch is padded to 8 for the inner vectorized loops). Only one VPU command is submitted; batching is internal to the kernel.

Within each snapshot, the 2D FFT also uses a FFT batch size of 8 (KERNEL_BATCH = 8): the azimuth FFT processes the snapshot in groups of 8 elevation rows at a time, and the elevation FFT processes in groups of 8 azimuth columns (lines) at a time. So for a single snapshot, the number of inner FFT iterations is (elevation size ÷ 8) for the azimuth pass and (azimuth size ÷ 8) for the elevation pass (e.g. 4 and 4 for 32×32, or 2 and 8 for 16×64). This keeps vector and memory access patterns efficient per snapshot.

Input and output tiling#

Input snapshots: Tiled per detection; tile dimensions match the 2D snapshot size (e.g., elevation × azimuth in elements, with pitch and padding for VMEM). One tile is one snapshot (e.g., width = 4 × 2 = 8 for 4 elevation × complex, height = 16 for azimuth, with IN_TILE_WIDTH_PAD / IN_TILE_HEIGHT_PAD as in the implementation).
Output target angles: Batched in groups of 8 (OUTPUT_BATCH_SIZE); tile width 8, tile height 3 (azimuth, elevation, power_dB).
Output target index map: Batched in groups of 8; tile width 8, tile height 1 (azimuth index, elevation index per target).
Output target count: Single scalar; tile 1×1.

Processing is done in batches of 8 detections for vectorization; partial batches are padded to 8 for the inner loops.

2D FFT and windowing#

Windowing: AGEN (address generators) are configured for 2D Hanning window application: element-wise multiply of the snapshot with the pre-computed 2D windowing coefficients, with appropriate Q-bit shifting.
Azimuth FFT: The FFT uses Decimation-In-Frequency (DIF). Radix-4 DIF stages are applied along the azimuth dimension; when the FFT size is not a power of 4 (i.e. when \(\log_2 N\) is odd, e.g. 32 or 128), Radix-2 DIF is applied in the last stage before the digit-reversal/transpose step. The digit-reversal/transpose then prepares data for the elevation FFT.
Elevation FFT: Radix-4 DIF (and Radix-2 DIF in the last stage when \(\log_2 N\) is odd) along the elevation dimension. Output is in VMEM; magnitude squared is computed to form the power spectrum.
Fixed-point: The snapshot input is in Q20 format (20 fractional bits). Windowing coefficients and Twiddle factors use Q30; By applying the appropriate right-shifts at each stage (via AGEN rounding and truncation), FFT intermediate and final outputs are kept in Q20 format; the per-stage scaling is chosen so that the effective Q-format is restored to Q20 after each butterfly or digit-reverse step, avoiding both overflow and loss of precision. Angle bins are stored in Q16.
FFT scaling to avoid overflow: To prevent overflow in fixed-point FFT stages, intermediate outputs are right-shifted after each butterfly stage. The shift depends on the radix: 2 bits for Radix-4 stages (equivalent to scaling by 1/4 per stage) and 1 bit for Radix-2 stages (equivalent to scaling by 1/2 per stage). Over the full FFT, these per-stage shifts accumulate so that the effective scaling is by the FFT size \(N\) (e.g. \(N = 4^{\text{stages}}\) for Radix-4 or \(N = 2^{\text{stages}}\) for Radix-2). This removes the need for an explicit division of the FFT output by \(N\) at the end while keeping the signal within the fixed-point range. The shifting is performed via the AGEN rounding capability: the output AGENs are configured with a round value (e.g. ADDITIONAL_SHIFT_QBITS = 2 for Radix-4 or ADDITIONAL_SHIFT_QBITS >> 1 for Radix-2), so that when the kernel stores FFT results with vstore(..., out), the AGEN applies the right-shift (round) during the store. No separate shift instruction is required in the inner loop.

Peak detection and interpolation#

Find max value: The elevation FFT stage computes the 2D power spectrum (magnitude squared) and returns the maximum value in that spectrum. The max value is found using the AGEN minmax capability: a dedicated AGEN (e.g. power_spectrum) is configured with minmax_opt = 3 and the same layout as the power-spectrum output buffer. As each power value is written via vstore(..., power_spectrum), the AGEN updates its internal max; after the loop, power_spectrum.max_val holds the maximum value over the entire 2D spectrum. That value is returned (e.g. by fft_batched_exec_elevation as max_val) and later passed to find_max_val_loc to locate the peak.
Peak detection using AGEN: The 2D index of the peak (peak azimuth index, peak elevation index) is found by scanning the power spectrum and vectorized compare/collation:
1. Scan and locate (find_max_val_loc): The kernel scans the power spectrum in a nested loop over rows and column blocks. For each block, it loads a double-width float vector (dvfloat_load(a_src)) from the AGEN, compares it to the known maximum value (dw_data == value), and demotes the result to a short mask. collate_nz_pos_vshort extracts the (x, y) coordinates of all lanes where the value equals the maximum. Row/column coordinates (h_pix_pos_x, h_pix_pos_y) are updated per block (stride in x, increment y at row boundaries). The first (or one) of the collated positions is written to peakAzimuthIndex and peakElevationIndex, giving the coarse 2D bin of the peak.
This scan avoids explicit index arithmetic in the inner loop and uses vectorized comparison and collation to find all locations that match the maximum value efficiently.
Gather neighbours: A 3×3 neighborhood around the peak is read (with circular or boundary handling as in the implementation) to get \(Y_{-1}, Y_0, Y_{+1}\) for both azimuth and elevation.
Quadratic interpolation: The formula \(p = (Y_{-1} - Y_{+1}) / (2(2Y_0 - Y_{-1} - Y_{+1}))\) is applied per dimension to get fractional bin offsets; interpolated indices are in Q16.
Angle conversion: 1D samplers (CupvaSampler) are used to map interpolated indices to angles using the angle bins LUT; peak power is converted to dB and written with the angles.

Angle estimation#

Azimuth and elevation: Both angles are derived from the 2D FFT bin indices (after interpolation) via the same mathematical relationship \(\theta = \arcsin(\text{normalized})\). The implementation uses pre-computed angle bins (one table for azimuth, one for elevation) and samplers for index-to-angle lookup.
Output: Per target, the kernel writes azimuth (degrees), elevation (degrees), and power (dB) to the target angles buffer, and writes the total number of targets to the target count buffer.

Performance#

The 2D Angle FFT operator is compute-bound and scales with the number of detections. Time is dominated by:

2D FFT and windowing: Per-snapshot 2D FFT (azimuth then elevation) and windowing, scaling with the number of detections.
Peak detection and interpolation: Per-snapshot power spectrum scan, 3×3 gather, quadratic interpolation, and angle/power output, also scaling with the number of detections.
SQDF transfers: Coefficients loaded once; snapshot and result tiles transferred per detection (or per batch of 8). Double-buffering overlaps transfer with compute where possible.

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores. Idle power is approximately 7W when the PVA is not processing data.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

NofDetections	NofTargets	NofTx	NofRx	Precision	Execution Time	Submit Latency	Total Power
256	256	8	8	1	0.547ms	0.037ms	10.126W
1024	1024	8	8	1	1.958ms	0.037ms	10.124W
1024	1024	8	8	1	2.215ms	0.054ms	NoneW
4096	4096	8	8	1	7.599ms	0.043ms	10.126W
8192	8192	8	8	1	15.106ms	0.042ms	10.92W
8192	8192	8	8	1	16.921ms	0.059ms	NoneW
8192	8192	8	8	2	52.653ms	0.045ms	11.118W
8192	8192	8	8	0	7.895ms	0.042ms	9.428W