Target Processing#

Overview#

Target processing is a PVA (Programmable Vision Accelerator) radar operator that converts angle-estimation outputs (target index map and target angles) and detection list data into a full target list: velocity, range, azimuth, elevation, Cartesian coordinates (X, Y, Z), and optional power.It uses angle-estimation outputs and the detection list, DDM Doppler offsets, and optionally NCI (Non-Coherent Integration) data. It performs range bin → physical range, Doppler bin → velocity (with optional RV decoupling and quadratic interpolation on NCI for sub-bin range/Doppler), and spherical → Cartesian conversion with configurable coordinate convention.

Algorithm Description#

  1. Inputs

    • Target count: Scalar, number of targets (from DoA operators).

    • Detection list: [numDetections, 2] int32 — per detection: range bin index, Doppler bin index (unfolded).

    • Target index map: [numTargets] int32 — maps each target to a detection index (generated by DoA operators).

    • Target angles: [3, numTargets] float32 — per target: azimuth (degrees), elevation (degrees), power (dB) (generated by DoA operators).

    • DDM Doppler offsets: [numDopplerFolds] float32 — Doppler offset per fold for velocity calculation.

    • NCI (optional): [numRangeBins, numDopplerBins] uint32 — used only when quadratic interpolation is enabled for sub-bin range/Doppler.

  2. Outputs

    • Target list: [7 or 8, numTargets] float32 — per target: velocity (m/s), range (m), azimuth (deg), elevation (deg), X (m), Y (m), Z (m), and optionally power (dB).

  3. Batch processing

    • For exploting the vectorization, the operator processes targets in batches. The batch size is 64 targets.

    • The operator loads the target index map and target angles per batch from DRAM (DoA operator’s outputs) into VMEM using SQDF; tile size matches batch size (e.g. 64).

    • The operator loads the target list per batch from VMEM to DRAM via SQDF.

    • The operator processes the targets in the batch.

    • The operator writes the target list per batch from VMEM to DRAM via SQDF.

  4. Processing per target (per batch)

    1. Index lookup: Using the target index map, the kernel obtains the detection index for each target. Via CupvaSampler APIs, it gathers the corresponding range bin index and Doppler bin index from the detection list (transposed into separate range and Doppler arrays).

    2. Range: Range bin index is converted to physical range (meters) using pre-derived constants (e.g. range resolution). If quadratic interpolation is enabled, the kernel loads a 3×3 NCI neighborhood around the (range, Doppler) bin via GSDF, computes sub-bin offsets for range and Doppler using a parabolic fit, and then converts the interpolated bin indices to range (and uses them in velocity/range formulas).

    3. Velocity: Doppler bin index (and DDM Doppler offset) is converted to Doppler frequency and then to velocity. When RV decoupling is enabled, the velocity is computed as \(v = f_{\text{fast}} \cdot C_{v,\text{fast}} - (f_{\text{slow}} + n/T_{\text{PRI}}) \cdot C_{v,\text{slow}}\) with a 5-hypothesis disambiguation loop: the kernel tests five velocity hypotheses (aliasing index \(n\)) and selects the one within the unambiguous velocity range \([v_{\text{amb,low}}, v_{\text{amb,high}}]\).

    4. Spherical to Cartesian: Using range (m), azimuth (deg), and elevation (deg), the kernel computes X, Y, Z with configurable longitudinal axis (e.g. Y-forward) and ground projection (e.g. scale by cos(elevation) for ground plane).

    5. Power (optional): If requested, power (dB) from target angles is copied to the target list.

Design Requirements#

  • Maximum number of targets (and detections) is 8192.

  • Target list height is 7 or 8 (8 when power is included).

  • DDM Doppler offsets width ≤ 32.

  • NCI is optional; when provided, it is padded (e.g. +1 in each dimension) and transferred to L2 with cyclic padding before GSDF gathers 3×3 tiles for interpolation.

  • Batch size for VPU processing is 64 targets per batch (BATCH_SIZE).

Implementation Details#

Dataflow configuration#

  • SQDF: One combined SQDF loads target count, full detection list, and DDM Doppler offsets from DRAM to VMEM once at the start.

  • SQDF: Target index map and target angles are loaded per batch from DRAM (DoA operator’s outputs) into VMEM using SQDF; tile size matches batch size (e.g. 64).

  • SQDF: Target list output is written per batch from VMEM to DRAM via SQDF.

  • GSDF (optional): When quadratic interpolation is enabled, NCI data is first transferred from DRAM to L2 with cyclic padding via SQDF (nciPaddingHdl); then GSDF (nciHdl) gathers up to NUM_NCI_NODES (32) tiles of 3×3 NCI neighborhoods (NCI_NODE_WIDTH × NCI_NODE_HEIGHT = 3×3, NCI_NODE_SIZE = 17 for alignment) from the L2-padded buffer. External base for GSDF is set to the L2 pointer.

Detection list transpose#

The detection list [numDetections, 2] is stored in VMEM in row-major form. The kernel transposes it into two arrays: detectionListRangeIndex and detectionListDopplerIndex, so that range indices and Doppler indices can be read in contiguous vector chunks. AGEN configurations (cfgsLoadDetectionList, cfgsStoreDetectionListRangeIndex, cfgsStoreDetectionListDopplerIndex) are used to load the detection list and store the two columns. The transpose kernel (detection_list_transpose_kernel) loads double-width int32 vectors, splits them into range and Doppler lanes, and stores them via two AGENs.

CupvaSampler APIs for index lookup#

Range sampler and velocity sampler (CupvaSampler) are set up so that for each target in the batch, the target index map (detection index) is used as the index into the detectionListRangeIndex and detectionListDopplerIndex arrays.

The CupvaSampler APIs perform 1D gather

  • Input - transposed detection list column

  • Indices - target index map for the batch

  • Output - range bin index and Doppler bin index per target.

This yields rangeBinIndex and dopplerBinIndex for the batch, which are then used for range/velocity computation and (when enabled) NCI gather.

Range, velocity, and Cartesian computation#

  • Range: Range bin index (integer or, with NCI quadratic interpolation, sub-bin refined) maps to meters using host range resolution in standard mode. With RV decoupling, range uses fast/slow beat-frequency terms from the bin indices and host contR (FMCW coupling).

  • Velocity: Doppler bin index and DDM Doppler offset give Doppler frequency; conversion to velocity uses wavelength and PRI. When RV decoupling is enabled, the kernel uses pre-computed coefficients (tempVCoeff[5], velocityAmbHigh, velocityAmbLow) and the 5-hypothesis loop (calcVelocityRvDecoupling) to select the unambiguous velocity.

  • Cartesian: Spherical (range, azimuth, elevation) is converted to (X, Y, Z) using trigonometric functions (sin/cos), with longitudinalAxis (e.g. Y-forward) and groundProjection (e.g. ground plane = range × cos(elevation)) applied as configured.

Quadratic interpolation (optional)#

Purpose. Detection outputs are sampled on a discrete range/Doppler grid. Quadratic interpolation uses a small local patch of the NCI (non-coherent integration) surface around each detection to estimate where the peak lies between bins. Those sub-bin corrections are applied to the range and Doppler indices before range (m) and velocity are computed, so both range accuracy and velocity accuracy are improved compared with using integer bin indices alone.

When enableQuadraticInterpolation is true and NCI tensor is provided:

  • The host transfers NCI from DRAM to a padded L2 buffer (with cyclic padding in Doppler and optional padding in range) via SQDF so that 3×3 neighborhoods around any (range, Doppler) bin are well-defined.

  • The VPU opens GSDF with external base set to the L2 padded NCI buffer. For each target in the batch, folded Doppler index is computed from Doppler bin index; then GSDF gather offsets are computed for the 3×3 NCI tiles around (rangeBinIndex, dopplerFoldedIndex). Up to NUM_NCI_NODES (32) tiles are gathered per GSDF trigger.

  • The kernel uses quadratic interpolation on the 3×3 NCI values (parabolic fit \(p = (Y_{-1} - Y_{+1}) / (2(2Y_0 - Y_{-1} - Y_{+1}))\)) to obtain fractional offsets for range and Doppler, then adds them to the bin indices used in the range and velocity paths above.

AGEN Configurations#

  • AGEN: Separate AGEN configurations are used to load target index map, range bin index, Doppler bin index, and target angles, and to store velocity, range, azimuth, elevation, X, Y, Z, and optionally power in the target list buffer. Batched vector loads/stores (e.g. 16 elements per vector) are used for throughput.

Double-buffering#

Target index map, target angles, and target list use ping-pong buffers (RDF_DOUBLE). While the current batch is being processed, the next batch’s target index map and target angles are loaded via SQDF. After processing, the current batch’s target list is written via SQDF; the next batch’s load is overlapped with the current batch’s store to improve throughput.

Performance#

  • Target processing is compute-bound and scales with the number of targets.

  • The most time-consuming operations are: 1. Detection list transpose (performed once) 2. Per-batch sampler lookups and range/velocity/Cartesian computation 3. Optional NCI GSDF and quadratic interpolation (when enabled) 4. SQDF transfers for target index map, target angles, and target list

  • Batch size of 64 is chosen to balance VMEM usage and enable efficient vectorization.

  • Double-buffering is used to overlap data transfer with computation for increased throughput.

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores. Idle power is approximately 7W when the PVA is not processing data.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.