RadarSnapshotExtraction#

Overview#

Snapshot Extraction is a radar signal processing operator that performs Doppler Division Multiplexing (DDM) disambiguation and gathers antenna snapshots from the range-Doppler map at detected target locations. It is typically used after CFAR detection to prepare data for Direction of Arrival (DOA) estimation.

The operator takes a folded detection list (range index, folded Doppler index), resolves the true Doppler indices using DDM offsets, and extracts complex-valued snapshots containing all TX-RX channel combinations for each detected target. Optional calibration weights can be applied during extraction.

Algorithm Description#

Formal inputs and outputs#

Inputs
1. Detection count: Scalar number of detections to process.
2. Folded detection list: \([N, 2]\) — per detection: range index, folded Doppler index (within a Doppler fold).
3. DDM Doppler offsets: [numDopplerFolds] float32 — offset per TX (or per fold) for DDM space and DDMA weights.
4. NCI Rx: \([\text{numRangeBins}, \text{numDopplerBins}]\) — unfolded NCI combined range–Doppler map over RX (used for DDM disambiguation).
5. Range–Doppler map (RDM): \([\text{numRangeBins}, \text{numRx}, \text{numDopplerBins}]\) or \([\text{numRangeBins}, \text{numDopplerBins}, \text{numRx}]\) complex int32 (FT2D), per rangeDopplerFormat.
6. Calibration weights (optional): \([\text{numTx}, \text{numRx}]\) complex int32 — if present, used instead of DDMA weights.
Outputs
1. Detection list: \([N, 2]\) — range index and unfolded Doppler index (Doppler bin for TX1).
2. Snapshots: \([N, \text{numTx}, \text{numRx}]\) complex int32 — virtual antenna response per detection after weighting.

Processing per detection#

The high-level stages match the following detailed steps.

DDM disambiguation#

In Doppler Division Multiplexing, different TX antennas transmit with different Doppler frequency offsets. The received signal at each folded Doppler bin is a superposition of contributions from multiple TX antennas. The operator resolves this ambiguity by:

Taking the folded detection coordinates (range, folded Doppler)
Applying TX-specific Doppler offsets to compute the true (unfolded) Doppler index for each TX antenna
Writing the unfolded detection list for downstream use

Snapshot gathering#

For each detection, the operator gathers a snapshot matrix of shape [numTx × numRx] from the range-Doppler map:

For each TX antenna, the appropriate Doppler bin is accessed using the DDM-resolved offset
For each RX channel at that Doppler bin, the complex value at the detected range bin is extracted
Optional calibration weights are applied element-wise to correct for antenna array imperfections

The resulting snapshot contains the complex antenna response for each virtual channel (TX-RX pair), which encodes the spatial information needed for DOA estimation.

Weight correction#

Gathered samples are multiplied element-wise by weights. If calibration weights are provided (\([\text{numTx}, \text{numRx}]\) complex, Q28), those are used. Otherwise DDMA weights are applied: one complex weight per TX,

\[w_{\text{tx}} = \overline{\exp\bigl(-j \, 2\pi \, \texttt{ddmDopplerOffset}[\text{tx}]\bigr)}\]

in fixed-point (Q16), broadcast across RX for that TX. Output is truncated/rounded to the configured format.

Notes#

The first and last range bins are not treated specially; all range indices present in the folded list are valid.
DDM disambiguation reuses the same NCI maximum-selection principle as standard Peak Detection.

Design Requirements#

Maximum number of detections: 8192 (VMEM limit for the folded detection list).
TX and RX counts: each at most 8; supported configurations include 4×4 and 8×8.
Maximum range bins: 512; Doppler bins: 4 to 512.
numDopplerFolds: 4 to 16, must divide numDopplerBins evenly; must be \(\geq\) number of TX antennas.
numDopplerFolds must match the DDM Doppler offsets tensor length.
RDM layout: PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_RX_DOPPLER or PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX.
Calibration weights tensor is optional; when omitted, DDMA weights are derived from DDM Doppler offsets.

Implementation Details#

Dataflow variants#

Two VPU implementations share the same math:

SQDF variant: For each detection, load one full RDM row (one range: \([\text{numRx}, \text{numDopplerBins}]\) or \([\text{numDopplerBins}, \text{numRx}]\)) and one NCI row via SQDF; index in VMEM with bestDopplerIdxs. Suited when row loads are acceptable.
GSDF variant: Gather only the needed RDM elements via GSDF—either \(\text{numTx} \times \text{numRx}\) single-complex tiles (RANGE_RX_DOPPLER) or numTx tiles of numRx complexes (RANGE_DOPPLER_RX). NCI is still loaded per range via SQDF. Reduces bandwidth when detections are numerous and the RDM is large.

The host selects the variant (for example from RDM size and format).

Dataflow configuration#

One SQDF for coefficients: DDMA or calibration weights and DDM space \([\text{numTx}]\).
One SQDF for the folded detection list (DRAM → VMEM).
One SQDF for NCI Rx (one range row at a time).
RDM: one SQDF (full row per detection) or one GSDF gather by computed offset.
One SQDF for snapshot output (VMEM → DRAM); one SQDF for the unfolded detection list.

Input and output tiling#

Folded detection list: VMEM \([\text{maxDetections}, 2]\); loaded once.
NCI: One row per range (numDopplerBins samples); loaded per detection’s range, double-buffered with the next.
RDM (SQDF): One range row per detection; double-buffered.
RDM (GSDF): Offsets per detection from range and bestDopplerIdxs; tile count \(\text{numTx} \times \text{numRx}\) or numTx with appropriate tile width.
Snapshots: One \([\text{numTx}, \text{numRx}]\) complex per detection; SQDF out, ping-pong with next detection.
Detection list out: \([\text{maxDetections}, 2]\) written after all detections.

DDM disambiguation on device#

Using the NCI row at the detection range, the kernel sums NCI at DDM-related bins for each candidate fold. A host-prepared DDM space bitmask (ddmSpaceBitMask) selects which NCI lanes participate per TX. Vector paths use vbitunpackw, vmux / dvmux, and vsumr_s. After argmax over candidates, TX1 Doppler is unfolded and bestDopplerIdxs is filled.

DDM space bitmask (`ddmSpaceBitMask`)#

Host (OpSnapshotExtraction.cpp):

From DDM offsets and numDopplerBins, compute per-TX DDMspace[tx].
foldedDdmSpace[tx] = DDMspace[tx] / foldSize with foldSize = numDopplerBins / numDopplerFolds.
Build bitMaskVec of length \(2 \times \text{numDopplerFolds}\) (1 = ignore, 0 = use); mark occupied folds; duplicate for periodicity.
Pack into a 32-bit ddmSpaceBitMask for vbitunpackw sliding windows: ddmSpaceBitMask |= bitMaskVec[1 + i] << i for \(i = 0 \ldots 2\text{numDopplerFolds}-2\).
Pass to VPU (scalar parameter / VMEM), loaded with other coefficients.

Device (summary): gatherNciRxTileComputeDopplerIdx fills nciRxTileBufGather for each candidate fold; computeNciRxSumFindMax_8folds / _16folds applies the unpacked bitmask, accumulates masked sums, shifts the mask, then selects the winning fold. Unfolded TX1 index: dopplerIdx = dopplerFoldIdx + index * (numDopplerBins / numDopplerFolds); snapshot indices bestDopplerIdxs[tx] = (DDMspace[tx] + dopplerIdx) % numDopplerBins.

RDM snapshot gather (device)#

SQDF: Index the loaded RDM row with (rxIdx, bestDopplerIdxs[tx]) or (bestDopplerIdxs[tx], rxIdx). GSDF: computeRdmGatherOffsets forms byte offsets for each (rangeIdx, bestDopplerIdxs[tx], rxIdx); gather into a contiguous complex buffer, then apply weights.

Weight correction (device)#

Pre-initialized AGEN paths feed weights, gathered RDM, and output. Element-wise complex multiply applies DDMA (Q16) or calibration (Q28) rules; without calibration, the first DDMA row is replicated across RX so one loop can cover all pairs. Results are written to the snapshot buffer and SQDF’d to DRAM.

Double-buffering#

Ping-pong NCI, RDM (and GSDF gather buffer), and snapshot output so the next detection’s loads overlap the current detection’s compute and DMA.

Performance#

Time splits roughly into:

DDM disambiguation — compute-bound vs detections and numDopplerFolds.
RDM transfer — SQDF is one full row per detection; GSDF moves fewer samples but adds per-tile overhead.
Weight correction — compute-bound in \(\text{numTx} \times \text{numRx}\).
Outputs — largely overlapped via double-buffering.

Choose GSDF when RDM rows are large and each detection only needs a small subset of samples.

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores. Idle power is approximately 7W when the PVA is not processing data.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

NumDetections	NumDopplerFolds	NumRangeBins	NumDopplerBins	NumRx	NumTx	RangeDopplerMapLayout	Calibrated	Execution Time	Submit Latency	Total Power
1024	16	256	256	8	8	RangeRxDoppler	OFF	1.825ms	0.073ms	11.149W
1024	8	256	256	4	4	RangeRxDoppler	OFF	2.053ms	0.060ms	8.736W
1024	8	512	512	4	4	RangeRxDoppler	OFF	2.114ms	0.062ms	8.837W
1024	16	512	512	8	8	RangeRxDoppler	OFF	2.398ms	0.063ms	12.156W
8192	16	256	256	8	8	RangeRxDoppler	OFF	14.454ms	0.154ms	11.942W
8192	8	256	256	4	4	RangeRxDoppler	OFF	15.710ms	0.135ms	8.736W
8192	8	512	512	4	4	RangeRxDoppler	OFF	16.195ms	0.162ms	9.634W
8192	16	512	512	8	8	RangeRxDoppler	OFF	19.072ms	0.153ms	13.351W
8192	8	256	256	4	4	RangeRxDoppler	Calibrated	15.726ms	0.136ms	8.736W
8192	16	512	512	8	8	RangeRxDoppler	Calibrated	19.055ms	0.149ms	13.249W
8192	16	512	512	8	8	RangeDopplerRx	OFF	10.021ms	0.166ms	9.837W
8192	8	256	256	4	4	RangeDopplerRx	OFF	8.520ms	0.093ms	9.236W
8192	8	512	512	4	4	RangeDopplerRx	OFF	8.542ms	0.123ms	9.338W
8192	16	256	256	8	8	RangeDopplerRx	OFF	9.957ms	0.165ms	9.236W
8192	16	512	512	8	8	RangeDopplerRx	Calibrated	10.058ms	0.159ms	9.439W