RadarNCI (Non-Coherent Integration)#

Overview#

Non-Coherent Integration (NCI) [1] [2] is a fundamental radar signal processing technique used to improve signal-to-noise ratio (SNR) by combining multiple radar returns across Rx channels and Doppler dimensions. The algorithm performs power-based accumulation of range-Doppler maps, which is particularly effective when phase coherence cannot be maintained across all measurements. NCI is an important step in automotive radar applications where it enables robust target detection in noisy environments by leveraging diversity from multiple Rx antennas and multiple Doppler measurements.

Algorithm Description#

The NCI implementation uses the following processing scheme:

The NCI processing input consists of:
1. FT2D, 2D FFT output representing the range-Doppler map with complex values for each RX antenna
The algorithm produces three outputs:
1. nciRx, unfolded range-Doppler map representing the NCI combined magnitude over RX antennas.
2. nciFinal, folded range-Doppler map accumulated over Doppler folds.
3. noiseEstimate, estimated noise level computed per range bin (optional, controlled by PVARadarNCIParams::noiseEstimationEnabled).
Supported input layouts (configured via PVARadarNCIParams::inputLayout), where Nb is the number of range bins, Nr is the number of Doppler bins (ramps/chirps), and NofRx is the number of RX antennas:
1. HCW (PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_RX_DOPPLER): shape [Nb][NofRx][Nr].
2. HCW (PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX): shape [Nb][Nr][NofRx].

The processing flow is as follows:

for each tile in the input data:
   // Stage 1: Compute magnitude and accumulate across RX channels
   // For PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_RX_DOPPLER layout: RX channels are processed in pairs (compute_nciRx_first / compute_nciRx)
   // For PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX layout: All RX channels accumulated in a single fused kernel
   //                 (compute_nciRx_RangeDopplerRx) using predicated stores
   for each pair of RX channels:
      magnitude = sqrt(real² + imag²)
      nciRx += magnitude

   // Average across RX channels
   nciRx = nciRx >> log2(numRxChannels)

   // Once all Doppler bins are accumulated for a row of range bins,
   // proceed to Doppler fold accumulation
   // Stage 2: Accumulate across Doppler folds
   for each group of 4 Doppler folds:
      nciFinal += sum of 4 consecutive Doppler fold segments from nciRx

   // Average across Doppler folds
   nciFinal = nciFinal >> log2(numDopplerFolds)

   // Stage 3: Compute noise estimate
   noiseEstimate = min_nciFinal * threshold

Notes:

The Rx channels are processed in pairs and the Doppler folds are processed in groups of 4. This is done to keep these values configurable. If the NCI algorithm fixes the number of Rx channels and the number of Doppler folds, then the algorithm can be optimized for the specific values by removing the pair and group processing and doing away with a lot of the conditional logic.
The noise estimation is performed along with the NCI computation using the agen minmax hardware feature in the compute_nciFinal kernel.
Input tiles may not have all Doppler bins for a channel, so the NCI accumulation for Doppler folds and noise estimation are done after all Doppler bins are accumulated, which may be after one or more input tiles are processed.
For the PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX layout, the input tile is loaded using transpose mode (TRANS_MODE_2) and the kernel uses a temporary buffer (nciRxTemp) with a subsequent copy step (copy_nciRxTemp_to_nciRx) to reorganize data into the [Nr][Nb] output layout before the Doppler fold stage.

Design Requirements#

The number of Doppler samples divided by the repeat fold must be at least 16, and the total samples must be divisible by 8.
The number of range bins must be a multiple of the tile height.
The combination of number of range bins, RX channels and number of samples must not exceed the tile dimensions.
The number of RX channels is limited to 2, 4, 8, 16, 32 and 64.
The combination of number of range bins, RX channels and number of doppler bins must not exceed the tile dimensions.
The number of doppler bins must be divisible by 8.
There should be at least 4 Doppler folds, and the number of Doppler folds must be a power of 2.
There should be at least 32 doppler bins in a Doppler fold, and the number of doppler bins in a Doppler fold must be divisible by 32.
The number of Doppler bins should be a multiple of the number of RX channels.

Implementation Details#

Input Layout#

The operator supports two input tensor layouts:

HCW (PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_RX_DOPPLER): [Nb][NofRx][Nr]. Each range bin contains NofRx contiguous channel blocks, each with Nr interleaved real/imaginary Doppler samples.
HCW (PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX): [Nb][Nr][NofRx]. Each range bin contains Nr Doppler sample blocks, each with NofRx interleaved real/imaginary channel values.

The layout is specified by the inputLayout field in PVARadarNCIParams.

Input Tiling#

The tile dimensions are calculated based on the maximum input tile size constraint (48 KB) and the data dimensions. The tiling constraints are:

The tile height (NrPerTile) is maximized first, then the tile width (NsPerTile) is set to maximize the number of Doppler samples per tile.
Each tile fits within the VMEM constraints (max input: 48 KB, max nciRx output: 32 KB, max nciFinal: 16 KB).
Double-buffered ping-pong is used for both input and output tiles to overlap DMA with compute.

NCI Accumulation#

The NCI accumulation is performed in multiple stages:

Stage 1 - RX Channel Integration (nciRx):

For the PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_RX_DOPPLER layout, the algorithm processes RX channels in pairs. For each pair, it computes the magnitude sqrt(real² + imag²) using floating-point conversion (vintx_vfp), fused multiply-add (vmaddf), square root (vfsqrt), and truncation back to integer. The first pair initializes the accumulation buffer (compute_nciRx_first), while subsequent pairs read-modify-write (compute_nciRx). After all channels are accumulated, the result is averaged by right-shifting by log2(numRxChannels) using the agen truncate mode.

For the PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX layout, a single fused kernel (compute_nciRx_RangeDopplerRx) processes all RX channels using predicated stores. It uses mod_inc to track the RX channel pair index and stores the accumulated result only after all pairs for a given Doppler/Range position have been summed. The output is written to a temporary transposed buffer and subsequently copied with shift-based averaging to the final nciRx output.

Stage 2 - Doppler Fold Accumulation: The unfolded Doppler data is reorganized into folded segments. Groups of 4 Doppler folds are processed together using SIMD operations. This is done to keep the number of Doppler folds configurable. The accumulation across folds improves SNR for targets with ambiguous Doppler. The final result is averaged by right-shifting by log2(numDopplerFolds).

Stage 3 - Noise Estimation: Concurrent with the NCI processing, noise statistics are computed for each range bin. This provides a reference for subsequent detection algorithms. The agen minmax collection is used to get the minimum values of the NCI accumulation buffer per range bin, which is then used to estimate the noise level by scaling it by a threshold.

The overall device-side processing flow is illustrated in the following figure:

NCI Device Side Processing Flow — Figure 1: NCI’s Device Side Processing Flow#

Performance#

The NCI algorithm’s performance characteristics can be decomposed as follows:

The NciRx kernel is compute-bound (dominated by vfsqrt) and scales with the tile dimensions and number of RX channels.
The NciFinal kernel is memory-bound.
The noise estimation adds minimal overhead as it operates inline with the NciFinal processing using agen minmax hardware.
For the PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX layout, an additional copy step (copy_nciRxTemp_to_nciRx) is required to reorganize from the transposed temporary buffer.

The implementation is optimized for typical automotive radar configurations with:

2-8 RX channels
512 or 1024 Doppler bins (64 or 128 folded bins)
Range dimensions that are multiples of the tile height

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores. Idle power is approximately 7W when the PVA is not processing data.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

RangeBinCount	RxAntennaCount	DopplerBinCount	DopplerFolds	Layout	NoiseEstimate	Execution Time	Submit Latency	Total Power
256	4	512	8	1	on	0.267ms	0.027ms	14.44W
256	4	512	8	2	on	0.294ms	0.026ms	14.343W
256	8	512	16	1	on	0.603ms	0.027ms	14.444W
256	8	512	16	2	on	0.625ms	0.027ms	14.339W
256	16	512	16	1	on	1.167ms	0.028ms	14.343W
512	2	512	8	1	on	0.278ms	0.029ms	14.544W
512	2	512	8	2	on	0.324ms	0.030ms	14.341W
512	4	512	8	1	on	0.516ms	0.029ms	15.144W
512	4	512	8	1	off	0.516ms	0.028ms	15.041W
512	4	512	8	2	on	0.572ms	0.030ms	14.944W
512	4	1024	8	1	on	0.997ms	0.032ms	15.743W
512	4	1024	8	2	on	1.109ms	0.033ms	15.146W
512	8	512	16	1	on	1.092ms	0.030ms	14.944W
512	8	512	16	1	off	1.091ms	0.028ms	15.041W
512	8	512	16	2	on	1.130ms	0.030ms	14.942W
512	8	512	16	2	off	1.129ms	0.028ms	14.944W
512	8	1024	16	1	on	2.026ms	0.033ms	15.244W
512	8	1024	16	2	on	2.093ms	0.033ms	15.146W
512	16	1024	32	1	on	4.414ms	0.034ms	15.043W
512	16	1024	32	2	on	4.074ms	0.035ms	15.248W

Reference#

M. A. Richards, “Fundamentals of Radar Signal Processing”, 2nd Edition, McGraw-Hill, New York, 2014.

M. A. Richards, “Notes on Noncoherent Integration Gain”, Georgia Tech Research Institute, 2014.