RadarDopplerFFT#

Overview#

Doppler FFT (Fast Fourier Transform) is a critical signal processing step in radar systems that follows Range FFT in the radar processing pipeline. It transforms range-processed data from the slow time domain into the frequency domain to extract target velocity (Doppler) information. By performing a 1D FFT along the slow time (chirp) dimension for each antenna channel and range bin, the module converts range profiles into range-Doppler maps, where each frequency bin corresponds to a specific radial velocity. The output of Doppler FFT enables velocity estimation and is used as input for subsequent steps such as peak detection and angle estimation.

Windowing is essential for minimizing spectral leakage in Doppler FFT processing, similar to Range FFT. This module supports multiple window types (e.g., Hanning, Hamming) with different coefficients defined. Users can select or configure the window function according to their application requirements to achieve optimal velocity resolution and sidelobe suppression.

Algorithm Description#

The Doppler FFT operator performs a 1D FFT along the slow time (chirp) dimension for each antenna channel and range bin. The input is a 3D tensor of shape (chirps x channels x range bins) from the Range FFT output. If transpose output mode is disabled, the output maintains the same 3D tensor structure but transforms the chirp dimension into Doppler (velocity) bins, resulting in a range-Doppler map of shape (Doppler bins x channels x range bins). If transpose output mode is enabled, the output will transpose the range bins dimension with the Doppler bins dimension, resulting in a range-Doppler map of shape (range bins x channels x Doppler bins). To achieve optimal performance, the implementation uses a mixed radix-4 and radix-2 FFT algorithms. The supported complex FFT sizes are 256 and 512.

Implementation Details#

Parameters#

  • Input tensor with shape (chirps x channels x range bins) and data type complex int32.

  • Output tensor with shape (Doppler bins x channels x range bins) if transpose output mode is disabled or (range bins x channels x Doppler bins) if transpose output mode is enabled and data type complex int32.

Dataflow Configuration#

  • Use 1 SQDF (SequenceDataflow) to transfer the twiddle factors and the coefficients of the window function from DRAM to VMEM.

  • Use 1 input and 1 output SQDFs to split the whole input/output tensor into tiles and transfer them between DRAM and VMEM one by one.

    • The width of the input tile equals to 8 batches per tile and the height equals to the number of chirps.

    • If transpose output mode is disabled, the width of the output tile equals to 8 batches per tile and the height equals to the number of Doppler bins.

    • If transpose output mode is enabled, the width of the output tile equals to the number of Doppler bins and the height equals to 8 batches per tile.

VMEM Buffer Allocation#

  • 1 buffer with data type int32 to store the twiddle factors and 1 buffer with data type int32 to store the coefficients of the window function.

  • 1 input with data type int32 and double buffering to store the data of each tiles reading from the input tensor.

  • 1 output with data type int32 and double buffering to store the data of each tiles writing to the output tensor.

  • 1 temporary buffer with data type int32 to store the data for transpose operation.

Workflow Implementation#

The kernels of the Doppler FFT operator are implemented with high performance fixed-point vectorized instructions on PVA with the following steps:

  1. The windowing function.

  2. \(ceil(log_4(N))-1\) stages of the fft_batched_radix4 function with radix-4 operation.

  3. The digit_reverse_transpose function with radix-4 operation. If \(N\) is not a power of 4, the digit_reverse_transpose function will execute radix-2 operation.

    • If transpose output mode is enabled, the digit_reverse_transpose function will transpose the data layout in the output tile.

Agen Configuration#

The agen configurations are set according to the radix-4/radix-2 operation in each stage. The agen configurations in the last radix stage are set to implement the digit reversal addressing with zero overhead. One problem is that the overflow may happen during the accumulation of the butterfly calculation in the radix-4/radix-2 operation. The solution is to set the rounding parameter of the store agen in each stage to 2-bit for radix-4 and 1-bit for radix-2. The rounding parameter should also account for the quantization bits (qbits) of the twiddle factors when they are used in the current stage. When storing the output data at the end of each stage, the rounding operation prevents accumulation overflow in subsequent stages.

VPU Function Implementation#

The fft_batched_radix4 function uses the following instructions to implement the radix-4 operation:

  • vaddsub4x2_0 and vaddsub4x2_1 are used to accelerate the butterfly calculation in the radix-4 operation.

  • dvcmulw_t16 is used to accelerate the complex multiplication with the twiddle factors.

To achieve the best performance, we manually allocate the vector registers and unroll the loop body in the fft_batched_radix4 function.

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores. Idle power is approximately 7W when the PVA is not processing data.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

RangeBinCount

RxAntennaCount

ChirpCount

InputOutputDataType

Execution Time

Submit Latency

Total Power

257

1

512

2S32

0.129ms

0.021ms

16.132W

257

4

512

2S32

0.447ms

0.025ms

16.132W

257

8

512

2S32

0.903ms

0.026ms

16.132W

513

1

512

2S32

0.231ms

0.024ms

16.613W

513

4

512

2S32

0.874ms

0.027ms

16.132W

513

8

512

2S32

1.750ms

0.028ms

16.132W

References#