RadarSnapshotExtraction#

Overview#

Snapshot Extraction is a radar signal processing operator that performs Doppler Division Multiplexing (DDM) disambiguation and gathers antenna snapshots from the range-Doppler map at detected target locations. It is typically used after CFAR detection to prepare data for Direction of Arrival (DOA) estimation.

The operator takes a folded detection list (range index, folded Doppler index), resolves the true Doppler indices using DDM offsets, and extracts complex-valued snapshots containing all TX-RX channel combinations for each detected target. Optional calibration weights can be applied during extraction.

Algorithm Description#

Formal inputs and outputs#

  1. Inputs

    1. Detection count: Scalar number of detections to process.

    2. Folded detection list: \([N, 2]\) — per detection: range index, folded Doppler index (within a Doppler fold).

    3. DDM Doppler offsets: [numDopplerFolds] float32 — offset per TX (or per fold) for DDM space and DDMA weights.

    4. NCI Rx: \([\text{numRangeBins}, \text{numDopplerBins}]\) — unfolded NCI combined range–Doppler map over RX (used for DDM disambiguation).

    5. Range–Doppler map (RDM): \([\text{numRangeBins}, \text{numRx}, \text{numDopplerBins}]\) or \([\text{numRangeBins}, \text{numDopplerBins}, \text{numRx}]\) complex int32 (FT2D), per rangeDopplerFormat.

    6. Calibration weights (optional): \([\text{numTx}, \text{numRx}]\) complex int32 — if present, used instead of DDMA weights.

  2. Outputs

    1. Detection list: \([N, 2]\) — range index and unfolded Doppler index (Doppler bin for TX1).

    2. Snapshots: \([N, \text{numTx}, \text{numRx}]\) complex int32 — virtual antenna response per detection after weighting.

Processing per detection#

The high-level stages match the following detailed steps.

DDM disambiguation#

In Doppler Division Multiplexing, different TX antennas transmit with different Doppler frequency offsets. The received signal at each folded Doppler bin is a superposition of contributions from multiple TX antennas. The operator resolves this ambiguity by:

  1. Taking the folded detection coordinates (range, folded Doppler)

  2. Applying TX-specific Doppler offsets to compute the true (unfolded) Doppler index for each TX antenna

  3. Writing the unfolded detection list for downstream use

Snapshot gathering#

For each detection, the operator gathers a snapshot matrix of shape [numTx × numRx] from the range-Doppler map:

  1. For each TX antenna, the appropriate Doppler bin is accessed using the DDM-resolved offset

  2. For each RX channel at that Doppler bin, the complex value at the detected range bin is extracted

  3. Optional calibration weights are applied element-wise to correct for antenna array imperfections

The resulting snapshot contains the complex antenna response for each virtual channel (TX-RX pair), which encodes the spatial information needed for DOA estimation.

Weight correction#

Gathered samples are multiplied element-wise by weights. If calibration weights are provided (\([\text{numTx}, \text{numRx}]\) complex, Q28), those are used. Otherwise DDMA weights are applied: one complex weight per TX,

\[w_{\text{tx}} = \overline{\exp\bigl(-j \, 2\pi \, \texttt{ddmDopplerOffset}[\text{tx}]\bigr)}\]

in fixed-point (Q16), broadcast across RX for that TX. Output is truncated/rounded to the configured format.

Notes#

  • The first and last range bins are not treated specially; all range indices present in the folded list are valid.

  • DDM disambiguation reuses the same NCI maximum-selection principle as standard Peak Detection.

Design Requirements#

  • Maximum number of detections: 8192 (VMEM limit for the folded detection list).

  • TX and RX counts: each at most 8; supported configurations include 4×4 and 8×8.

  • Maximum range bins: 512; Doppler bins: 4 to 512.

  • numDopplerFolds: 4 to 16, must divide numDopplerBins evenly; must be \(\geq\) number of TX antennas.

  • numDopplerFolds must match the DDM Doppler offsets tensor length.

  • RDM layout: PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_RX_DOPPLER or PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX.

  • Calibration weights tensor is optional; when omitted, DDMA weights are derived from DDM Doppler offsets.

Implementation Details#

Dataflow variants#

Two VPU implementations share the same math:

  • SQDF variant: For each detection, load one full RDM row (one range: \([\text{numRx}, \text{numDopplerBins}]\) or \([\text{numDopplerBins}, \text{numRx}]\)) and one NCI row via SQDF; index in VMEM with bestDopplerIdxs. Suited when row loads are acceptable.

  • GSDF variant: Gather only the needed RDM elements via GSDF—either \(\text{numTx} \times \text{numRx}\) single-complex tiles (RANGE_RX_DOPPLER) or numTx tiles of numRx complexes (RANGE_DOPPLER_RX). NCI is still loaded per range via SQDF. Reduces bandwidth when detections are numerous and the RDM is large.

The host selects the variant (for example from RDM size and format).

Dataflow configuration#

  • One SQDF for coefficients: DDMA or calibration weights and DDM space \([\text{numTx}]\).

  • One SQDF for the folded detection list (DRAM → VMEM).

  • One SQDF for NCI Rx (one range row at a time).

  • RDM: one SQDF (full row per detection) or one GSDF gather by computed offset.

  • One SQDF for snapshot output (VMEM → DRAM); one SQDF for the unfolded detection list.

Input and output tiling#

  • Folded detection list: VMEM \([\text{maxDetections}, 2]\); loaded once.

  • NCI: One row per range (numDopplerBins samples); loaded per detection’s range, double-buffered with the next.

  • RDM (SQDF): One range row per detection; double-buffered.

  • RDM (GSDF): Offsets per detection from range and bestDopplerIdxs; tile count \(\text{numTx} \times \text{numRx}\) or numTx with appropriate tile width.

  • Snapshots: One \([\text{numTx}, \text{numRx}]\) complex per detection; SQDF out, ping-pong with next detection.

  • Detection list out: \([\text{maxDetections}, 2]\) written after all detections.

DDM disambiguation on device#

Using the NCI row at the detection range, the kernel sums NCI at DDM-related bins for each candidate fold. A host-prepared DDM space bitmask (ddmSpaceBitMask) selects which NCI lanes participate per TX. Vector paths use vbitunpackw, vmux / dvmux, and vsumr_s. After argmax over candidates, TX1 Doppler is unfolded and bestDopplerIdxs is filled.

DDM space bitmask (ddmSpaceBitMask)#

Host (OpSnapshotExtraction.cpp):

  1. From DDM offsets and numDopplerBins, compute per-TX DDMspace[tx].

  2. foldedDdmSpace[tx] = DDMspace[tx] / foldSize with foldSize = numDopplerBins / numDopplerFolds.

  3. Build bitMaskVec of length \(2 \times \text{numDopplerFolds}\) (1 = ignore, 0 = use); mark occupied folds; duplicate for periodicity.

  4. Pack into a 32-bit ddmSpaceBitMask for vbitunpackw sliding windows: ddmSpaceBitMask |= bitMaskVec[1 + i] << i for \(i = 0 \ldots 2\text{numDopplerFolds}-2\).

  5. Pass to VPU (scalar parameter / VMEM), loaded with other coefficients.

Device (summary): gatherNciRxTileComputeDopplerIdx fills nciRxTileBufGather for each candidate fold; computeNciRxSumFindMax_8folds / _16folds applies the unpacked bitmask, accumulates masked sums, shifts the mask, then selects the winning fold. Unfolded TX1 index: dopplerIdx = dopplerFoldIdx + index * (numDopplerBins / numDopplerFolds); snapshot indices bestDopplerIdxs[tx] = (DDMspace[tx] + dopplerIdx) % numDopplerBins.

RDM snapshot gather (device)#

SQDF: Index the loaded RDM row with (rxIdx, bestDopplerIdxs[tx]) or (bestDopplerIdxs[tx], rxIdx). GSDF: computeRdmGatherOffsets forms byte offsets for each (rangeIdx, bestDopplerIdxs[tx], rxIdx); gather into a contiguous complex buffer, then apply weights.

Weight correction (device)#

Pre-initialized AGEN paths feed weights, gathered RDM, and output. Element-wise complex multiply applies DDMA (Q16) or calibration (Q28) rules; without calibration, the first DDMA row is replicated across RX so one loop can cover all pairs. Results are written to the snapshot buffer and SQDF’d to DRAM.

Double-buffering#

Ping-pong NCI, RDM (and GSDF gather buffer), and snapshot output so the next detection’s loads overlap the current detection’s compute and DMA.

Performance#

Time splits roughly into:

  1. DDM disambiguation — compute-bound vs detections and numDopplerFolds.

  2. RDM transfer — SQDF is one full row per detection; GSDF moves fewer samples but adds per-tile overhead.

  3. Weight correction — compute-bound in \(\text{numTx} \times \text{numRx}\).

  4. Outputs — largely overlapped via double-buffering.

Choose GSDF when RDM rows are large and each detection only needs a small subset of samples.

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores. Idle power is approximately 7W when the PVA is not processing data.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

NumDetections

NumDopplerFolds

NumRangeBins

NumDopplerBins

NumRx

NumTx

RangeDopplerMapLayout

Calibrated

Execution Time

Submit Latency

Total Power

1024

16

256

256

8

8

RangeRxDoppler

OFF

1.825ms

0.073ms

11.149W

1024

8

256

256

4

4

RangeRxDoppler

OFF

2.053ms

0.060ms

8.736W

1024

8

512

512

4

4

RangeRxDoppler

OFF

2.114ms

0.062ms

8.837W

1024

16

512

512

8

8

RangeRxDoppler

OFF

2.398ms

0.063ms

12.156W

8192

16

256

256

8

8

RangeRxDoppler

OFF

14.454ms

0.154ms

11.942W

8192

8

256

256

4

4

RangeRxDoppler

OFF

15.710ms

0.135ms

8.736W

8192

8

512

512

4

4

RangeRxDoppler

OFF

16.195ms

0.162ms

9.634W

8192

16

512

512

8

8

RangeRxDoppler

OFF

19.072ms

0.153ms

13.351W

8192

8

256

256

4

4

RangeRxDoppler

Calibrated

15.726ms

0.136ms

8.736W

8192

16

512

512

8

8

RangeRxDoppler

Calibrated

19.055ms

0.149ms

13.249W

8192

16

512

512

8

8

RangeDopplerRx

OFF

10.021ms

0.166ms

9.837W

8192

8

256

256

4

4

RangeDopplerRx

OFF

8.520ms

0.093ms

9.236W

8192

8

512

512

4

4

RangeDopplerRx

OFF

8.542ms

0.123ms

9.338W

8192

16

256

256

8

8

RangeDopplerRx

OFF

9.957ms

0.165ms

9.236W

8192

16

512

512

8

8

RangeDopplerRx

Calibrated

10.058ms

0.159ms

9.439W