RadarNCI (Non-Coherent Integration)#
Overview#
Non-Coherent Integration (NCI) [1] [2] is a fundamental radar signal processing technique used to improve signal-to-noise ratio (SNR) by combining multiple radar returns across Rx channels and Doppler dimensions. The algorithm performs power-based accumulation of range-Doppler maps, which is particularly effective when phase coherence cannot be maintained across all measurements. NCI is an important step in automotive radar applications where it enables robust target detection in noisy environments by leveraging diversity from multiple Rx antennas and multiple Doppler measurements.
Algorithm Description#
The NCI implementation uses the following processing scheme:
The NCI processing input consists of:
FT2D, 2D FFT output representing the range-Doppler map with complex values for each RX antenna
The algorithm produces three outputs:
nciRx, unfolded range-Doppler map representing the NCI combined magnitude over RX antennas.nciFinal, folded range-Doppler map accumulated over Doppler folds.noiseEstimate, estimated noise level computed per range bin (optional, controlled byPVARadarNCIParams::noiseEstimationEnabled).
Supported input layouts (configured via
PVARadarNCIParams::inputLayout), whereNbis the number of range bins,Nris the number of Doppler bins (ramps/chirps), andNofRxis the number of RX antennas:HCW (
PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_RX_DOPPLER): shape[Nb][NofRx][Nr].HCW (
PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX): shape[Nb][Nr][NofRx].
The processing flow is as follows:
for each tile in the input data: // Stage 1: Compute magnitude and accumulate across RX channels // For PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_RX_DOPPLER layout: RX channels are processed in pairs (compute_nciRx_first / compute_nciRx) // For PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX layout: All RX channels accumulated in a single fused kernel // (compute_nciRx_RangeDopplerRx) using predicated stores for each pair of RX channels: magnitude = sqrt(real² + imag²) nciRx += magnitude // Average across RX channels nciRx = nciRx >> log2(numRxChannels) // Once all Doppler bins are accumulated for a row of range bins, // proceed to Doppler fold accumulation // Stage 2: Accumulate across Doppler folds for each group of 4 Doppler folds: nciFinal += sum of 4 consecutive Doppler fold segments from nciRx // Average across Doppler folds nciFinal = nciFinal >> log2(numDopplerFolds) // Stage 3: Compute noise estimate noiseEstimate = min_nciFinal * threshold
Notes:
The Rx channels are processed in pairs and the Doppler folds are processed in groups of 4. This is done to keep these values configurable. If the NCI algorithm fixes the number of Rx channels and the number of Doppler folds, then the algorithm can be optimized for the specific values by removing the pair and group processing and doing away with a lot of the conditional logic.
The noise estimation is performed along with the NCI computation using the agen minmax hardware feature in the
compute_nciFinalkernel.Input tiles may not have all Doppler bins for a channel, so the NCI accumulation for Doppler folds and noise estimation are done after all Doppler bins are accumulated, which may be after one or more input tiles are processed.
For the PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX layout, the input tile is loaded using transpose mode (
TRANS_MODE_2) and the kernel uses a temporary buffer (nciRxTemp) with a subsequent copy step (copy_nciRxTemp_to_nciRx) to reorganize data into the[Nr][Nb]output layout before the Doppler fold stage.
Design Requirements#
The number of Doppler samples divided by the repeat fold must be at least 16, and the total samples must be divisible by 8.
The number of range bins must be a multiple of the tile height.
The combination of number of range bins, RX channels and number of samples must not exceed the tile dimensions.
The number of RX channels is limited to 2, 4, 8, 16, 32 and 64.
The combination of number of range bins, RX channels and number of doppler bins must not exceed the tile dimensions.
The number of doppler bins must be divisible by 8.
There should be at least 4 Doppler folds, and the number of Doppler folds must be a power of 2.
There should be at least 32 doppler bins in a Doppler fold, and the number of doppler bins in a Doppler fold must be divisible by 32.
The number of Doppler bins should be a multiple of the number of RX channels.
Implementation Details#
Input Layout#
The operator supports two input tensor layouts:
HCW (
PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_RX_DOPPLER):[Nb][NofRx][Nr]. Each range bin contains NofRx contiguous channel blocks, each with Nr interleaved real/imaginary Doppler samples.HCW (
PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX):[Nb][Nr][NofRx]. Each range bin contains Nr Doppler sample blocks, each with NofRx interleaved real/imaginary channel values.
The layout is specified by the inputLayout field in PVARadarNCIParams.
Input Tiling#
The tile dimensions are calculated based on the maximum input tile size constraint (48 KB) and the data dimensions. The tiling constraints are:
The tile height (NrPerTile) is maximized first, then the tile width (NsPerTile) is set to maximize the number of Doppler samples per tile.
Each tile fits within the VMEM constraints (max input: 48 KB, max nciRx output: 32 KB, max nciFinal: 16 KB).
Double-buffered ping-pong is used for both input and output tiles to overlap DMA with compute.
NCI Accumulation#
The NCI accumulation is performed in multiple stages:
Stage 1 - RX Channel Integration (nciRx):
For the PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_RX_DOPPLER layout, the algorithm processes RX channels in pairs. For each pair, it computes the magnitude sqrt(real² + imag²) using floating-point conversion (vintx_vfp), fused multiply-add (vmaddf), square root (vfsqrt), and truncation back to integer. The first pair initializes the accumulation buffer (compute_nciRx_first), while subsequent pairs read-modify-write (compute_nciRx). After all channels are accumulated, the result is averaged by right-shifting by log2(numRxChannels) using the agen truncate mode.
For the PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX layout, a single fused kernel (compute_nciRx_RangeDopplerRx) processes all RX channels using predicated stores. It uses mod_inc to track the RX channel pair index and stores the accumulated result only after all pairs for a given Doppler/Range position have been summed. The output is written to a temporary transposed buffer and subsequently copied with shift-based averaging to the final nciRx output.
Stage 2 - Doppler Fold Accumulation: The unfolded Doppler data is reorganized into folded segments. Groups of 4 Doppler folds are processed together using SIMD operations. This is done to keep the number of Doppler folds configurable. The accumulation across folds improves SNR for targets with ambiguous Doppler. The final result is averaged by right-shifting by log2(numDopplerFolds).
Stage 3 - Noise Estimation: Concurrent with the NCI processing, noise statistics are computed for each range bin. This provides a reference for subsequent detection algorithms. The agen minmax collection is used to get the minimum values of the NCI accumulation buffer per range bin, which is then used to estimate the noise level by scaling it by a threshold.
The overall device-side processing flow is illustrated in the following figure:
Figure 1: NCI’s Device Side Processing Flow#
Performance#
The NCI algorithm’s performance characteristics can be decomposed as follows:
The NciRx kernel is compute-bound (dominated by
vfsqrt) and scales with the tile dimensions and number of RX channels.The NciFinal kernel is memory-bound.
The noise estimation adds minimal overhead as it operates inline with the NciFinal processing using agen minmax hardware.
For the PVA_DOPPLER_FFT_OUTPUT_LAYOUT_RANGE_DOPPLER_RX layout, an additional copy step (
copy_nciRxTemp_to_nciRx) is required to reorganize from the transposed temporary buffer.
The implementation is optimized for typical automotive radar configurations with:
2-8 RX channels
512 or 1024 Doppler bins (64 or 128 folded bins)
Range dimensions that are multiples of the tile height
Execution Time is the average time required to execute the operator on a single VPU core.
Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.
Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores.
Idle power is approximately 7W when the PVA is not processing data.
For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.
RangeBinCount |
RxAntennaCount |
DopplerBinCount |
DopplerFolds |
Layout |
NoiseEstimate |
Execution Time |
Submit Latency |
Total Power |
|---|---|---|---|---|---|---|---|---|
256 |
4 |
512 |
8 |
1 |
on |
0.267ms |
0.027ms |
14.44W |
256 |
4 |
512 |
8 |
2 |
on |
0.294ms |
0.026ms |
14.343W |
256 |
8 |
512 |
16 |
1 |
on |
0.603ms |
0.027ms |
14.444W |
256 |
8 |
512 |
16 |
2 |
on |
0.625ms |
0.027ms |
14.339W |
256 |
16 |
512 |
16 |
1 |
on |
1.167ms |
0.028ms |
14.343W |
512 |
2 |
512 |
8 |
1 |
on |
0.278ms |
0.029ms |
14.544W |
512 |
2 |
512 |
8 |
2 |
on |
0.324ms |
0.030ms |
14.341W |
512 |
4 |
512 |
8 |
1 |
on |
0.516ms |
0.029ms |
15.144W |
512 |
4 |
512 |
8 |
1 |
off |
0.516ms |
0.028ms |
15.041W |
512 |
4 |
512 |
8 |
2 |
on |
0.572ms |
0.030ms |
14.944W |
512 |
4 |
1024 |
8 |
1 |
on |
0.997ms |
0.032ms |
15.743W |
512 |
4 |
1024 |
8 |
2 |
on |
1.109ms |
0.033ms |
15.146W |
512 |
8 |
512 |
16 |
1 |
on |
1.092ms |
0.030ms |
14.944W |
512 |
8 |
512 |
16 |
1 |
off |
1.091ms |
0.028ms |
15.041W |
512 |
8 |
512 |
16 |
2 |
on |
1.130ms |
0.030ms |
14.942W |
512 |
8 |
512 |
16 |
2 |
off |
1.129ms |
0.028ms |
14.944W |
512 |
8 |
1024 |
16 |
1 |
on |
2.026ms |
0.033ms |
15.244W |
512 |
8 |
1024 |
16 |
2 |
on |
2.093ms |
0.033ms |
15.146W |
512 |
16 |
1024 |
32 |
1 |
on |
4.414ms |
0.034ms |
15.043W |
512 |
16 |
1024 |
32 |
2 |
on |
4.074ms |
0.035ms |
15.248W |
Reference#
M. A. Richards, “Fundamentals of Radar Signal Processing”, 2nd Edition, McGraw-Hill, New York, 2014.
M. A. Richards, “Notes on Noncoherent Integration Gain”, Georgia Tech Research Institute, 2014.