RadarRangeFFT#

Overview#

Range FFT (Fast Fourier Transform) is a fundamental signal processing step in modern radar systems. It transforms time-domain ADC samples collected from each radar pulse (or chirp) into the frequency domain, enabling the extraction of target range (distance) information. By performing a 1D FFT along the fast time (sample) dimension for each antenna channel and chirp, the module converts raw sampled data into a range profile, where each frequency bin corresponds to a specific distance. The output of Range FFT is used as input for subsequent steps such as Doppler FFT (velocity estimation), CFAR detection, and angle estimation.

Windowing is crucial for minimizing spectral leakage in FFT processing. This module supports multiple window types (e.g., Hanning, Hamming) with different coefficients defined. Users can select or set the window function according to their application needs.

Algorithm Description#

The Range FFT operator performs a 1D FFT along the fast time (sample) dimension for each antenna channel and chirp. The input is a 3D tensor of shape (chirps x channels x samples). For the convenience of the subsequent step such as Doppler FFT, the output is transposed to a 3D tensor of shape (chirps x channels x range bins). In order to achieve the best performance, the implementation uses a mixed radix-4 and radix-2 FFT algorithms. The supported real FFT sizes are 512 and 1024.

Implementation Details#

Parameters#

  • Input tensor with shape (chirps x channels x samples) and data type int32. The input data needs to be left-shifted by (32 - raw data bits) bits before the Range FFT operator as the fraction bits during the FFT processing.

  • Output tensor with shape (chirps x channels x range bins) and data type complex int32.

Dataflow Configuration#

  • Use 1 SQDF (SequenceDataflow) to transfer the twiddle factors and the coefficients of the window function from DRAM to VMEM.

  • Use 1 input and 1 output RDFs (RasterDataflow) to split the whole input/output tensor into tiles and transfer them between DRAM and VMEM one by one.

    • The width of the input tile equals to the number of samples and the height equals to 8 batches per tile.

    • The width of the output tile equals to the number of range bins and the height equals to 8 batches per tile.

VMEM Buffer Allocation#

  • 1 buffer with data type int32 to store the twiddle factors and 1 buffer with data type int32 to store the coefficients of the window function.

  • 1 input with data type int32 and double buffering to store the data of each tiles reading from the input tensor.

  • 1 output with data type int32 and double buffering to store the data of each tiles writing to the output tensor.

  • 1 temporary buffer with data type int32 to store the data for transpose operation.

Workflow Implementation#

Since the input data of the Range FFT is real-valued, we can leverage the symmetry of the FFT to reduce the computation by converting the real FFT of size \(N\) into a complex FFT of size \(N/2\).

The kernels of the Range FFT operator are implemented with high performance fixed-point vectorized instructions on PVA with the following steps:

  1. The transpose_windowing function to transpose real-valued input data to complex-valued data to follow the batch FFT data layout requirements when doing the windowing.

  2. \(ceil(log_4(N/2))-1\) stages of the fft_real_batched_radix4 function with radix-4 operation.

  3. The digit_reverse function with radix-4 operation. If \(N/2\) is not a power of 4, the digit_reverse function will execute radix-2 operation.

  4. The post_process_transpose function to generate the \((N/2+1)\) complex-valued output data and transpose the output data layout to follow the batch FFT data layout requirements for the next Doppler FFT.

Agen Configuration#

The agen configurations are set according to the radix-4/radix-2 operation in each stage. The agen configurations in the last radix stage are set to implement the digit reversal addressing with zero overhead. One problem is that the overflow may happen during the accumulation of the butterfly calculation in the radix-4/radix-2 operation. The solution is to set the rounding parameter of the store agen in each stage to 2-bit for radix-4 and 1-bit for radix-2. The rounding parameter should also account for the quantization bits (qbits) of the twiddle factors when they are used in the current stage. When storing the output data at the end of each stage, the rounding operation prevents accumulation overflow in subsequent stages.

VPU Function Implementation#

To pack the real-valued input data into a complex-valued data with zero overhead, we use the dvint_load_transp2_di to load the corresponding location of the data in the transpose_windowing function.

The fft_real_batched_radix4 function uses the following instructions to implement the radix-4 operation:

  • vaddsub4x2_0 and vaddsub4x2_1 are used to accelerate the butterfly calculation in the radix-4 operation.

  • dvcmulw_t16 is used to accelerate the complex multiplication with the twiddle factors.

To achieve the best performance, we manually allocate the vector registers and unroll the loop body in the fft_real_batched_radix4 function.

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores. Idle power is approximately 7W when the PVA is not processing data.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

SampleCount

RxAntennaCount

ChirpCount

InputDataType

OutputDataType

Execution Time

Submit Latency

Total Power

512

1

512

S32

2S32

0.128ms

0.022ms

16.759W

512

4

512

S32

2S32

0.461ms

0.024ms

17.254W

512

8

512

S32

2S32

0.906ms

0.026ms

17.254W

1024

1

512

S32

2S32

0.242ms

0.024ms

17.524W

1024

4

512

S32

2S32

0.914ms

0.028ms

17.907W

1024

8

512

S32

2S32

1.811ms

0.030ms

17.535W

References#