Radar Processing Pipeline#

Radar Pipeline Architecture Overview#

The radar PVA pipeline architecture is designed to process raw radar data (ADC samples) and generate a point cloud. The pipeline architecture consists of 5 stages. The Range-fft, Doppler-fft and Peak Detection stages are implemented using fixed-point arithmetic, and NCI and DOA algorithms employ a hybrid approach, utilizing fixed-point arithmetic for computational efficiency while incorporating floating-point operations where the operations do not have native HW support.

  1. Range FFT Processing -> Converts time-domain samples to range-frequency domain using windowed FFT. The PVA operator used for this stage is pvaRangeFFT.

  2. Doppler FFT Processing -> Processes the range data to extract velocity of targets. The PVA operator used for this stage is pvaDopplerFFT.

  3. NCI Processing -> Performs noise estimation and non-coherent integration to improve signal-to-noise ratio. The PVA operator used is pvaNci.

  4. Peak Detection -> Performs local maximum detection in both range and doppler dimensions, followed by transmitter-specific peak identification and snapshot data extraction. The PVA operator is pvaPeakDetection.

  5. DOA Processing -> Calculates the direction of arrivals of the radar signals. It uses input tensors including peak count, calibration vector, and range-doppler snapshots to calculate target information such as azimuth, elevation, range, velocity and 3D points. The output is a 2D tensor with shape of [7, DOA_MAX_TARGET_COUNT]. The PVA operator is pvaDOA.

Radar fixed-point pipeline sample application#

The Radar fixed-point pipeline sample application is a comprehensive, reference implementation for processing radar data through multiple computational stages using NVIDIA’s PVA (Programmable Vision Accelerator) platform.

Key Characteristics:

  • Fixed-Point Arithmetic: The entire pipeline operates using mixed fixed-point and floating-point data types (S32, 2S32, U32, F32) to ensure deterministic computation, reduce memory bandwidth, and optimize for PVA hardware acceleration

  • Multi-Stage Processing: Five sequential processing stages that progressively refine raw radar data into target detection and tracking information

  • Hardware Acceleration: Leverages NVIDIA PVA for parallel processing of computationally intensive operations like FFT, non-coherent integration, and peak detection

  • Reference Validation: Each stage includes CPU-based reference implementations for accuracy validation and regression testing

  • Batch Processing: Supports processing multiple radar data files in sequence with efficient resource reuse

  • Configurable Parameters: Flexible configuration for different radar configurations (transmit/receive channels, range bins, sampling rates)

Pipeline Data Flow:

The pipeline processes radar data through the following transformation sequence:

  1. Raw ADC DataRange FFTDoppler FFTNCI ProcessingPeak DetectionDOA ProcessingTarget Information

Technical Implementation:

The implementation follows a modular, object-oriented design pattern with clear separation of concerns:

  • Resource Management: RAII-based tensor and operator lifecycle management with automatic cleanup

  • Memory Optimization: Uses PVA-optimized memory allocators for efficient data placement and access patterns. In production code, users can leverage NvSci and CUDA interop APIs to eliminate memory copies required for input and output data in the end-to-end pipeline.

  • Synchronization: PVA SDK based synchronization primitives for coordinating PVA and CPU operations

  • Error Handling: Comprehensive error checking with graceful degradation and resource cleanup

  • Validation Framework: Built-in reference implementations and tolerance-based output validation

Supported Radar Configurations:

  • Multi-Channel Processing: PVA_RADAR_RX_ANTENNA_COUNT (4) receive channels and PVA_RADAR_TX_ANTENNA_COUNT (4) transmit channels

  • Configurable Resolution: Adjustable range bins (Nb), sample count (sampleCount), and Doppler bins (chirpCount)

  • Doppler Fold Processing: Support for non-coherent integration across multiple pulse repetition intervals (PVA_RADAR_DOPPLER_FOLD_COUNT)

  • Peak Capacity: Handles up to PVA_RADAR_MAX_TARGET_COUNT (8192) simultaneous target detections

  • Target Properties: Outputs PVA_RADAR_NUM_TARGET_DETECTION_PROPERTIES (7) detection properties per target

Performance Features:

  • Pipeline Mapping: Efficiently map the pipeline stages to the PVA hardware resources to maximize throughput

  • Deterministic Timing: Fixed-point operations provide consistent execution times for real-time applications

Design Details#

The radar pipeline consists of the following main components:

  • RadarPipelineTensorRequirements: Holds tensor requirements without allocation (used for operator creation)

  • RadarPipelineTensors: Manages all tensor allocations for processing a single radar data file

  • RadarPipelineOperators: Manages PVA operator handles (created once, reused across multiple files)

  • RadarPipelinePVAWorkloadParams: Handles CUPVA synchronization objects and allocator

Fixed-Point Data Types and Precision:

The pipeline uses carefully selected fixed-point data types optimized for radar signal processing:

  • NVCV_DATA_TYPE_S32: 32-bit signed integers for input radar samples and window coefficients

  • NVCV_DATA_TYPE_2S32: Complex 32-bit signed integers (real + imaginary) for FFT operations

  • NVCV_DATA_TYPE_U32: 32-bit unsigned integers for NCI outputs and peak detection results

  • NVCV_DATA_TYPE_F32: 32-bit floating-point for final DOA calculations requiring high precision

Quantization and Q-Format:

The implementation uses Q-format fixed-point arithmetic with configurable precision:

  • Range FFT: Requires QBits = 20 for input tensor to maintain high precision for subsequent pipeline stages, e.g. Peak Detection, NCI and DOA processing

  • Peak Detection: Integer arithmetic for robust threshold comparison and peak identification

  • Doppler Fold Count: PVA_RADAR_DOPPLER_FOLD_COUNT defines non-coherent integration factor

Memory Layout and Tensor Organization:

Tensors are organized with optimal memory layouts for PVA processing:

  • HCW Layout: Height-Channel-Width organization for efficient parallel processing

  • Transpose: Uses transpose operations to optimize memory access patterns

  • Stride Optimization: Memory strides aligned to PVA requirements for maximum throughput

  • Buffer Reuse: Input tensors for subsequent stages point to output tensors of previous stages (e.g., inDopplerFFTTensorHandle = outRangeFFTTensorHandle)

Execution Model and Workload Management:

The pipeline implements a sophisticated execution model optimized for batch processing with operator reuse:

Phase 1: Operator Creation (Once per Configuration)
  • PVA operators created once using RadarPipelineTensorRequirements

  • Tensor requirements calculated without actual memory allocation

  • Operators shared across all data files for efficiency

  • Geometric parameters (GP) initialized with calibration data

Phase 2: Per-File Processing
  • RadarPipelineTensors created for each file with actual memory allocation

  • RAII (Resource Acquisition Is Initialization) ensures automatic cleanup

  • Input data loaded into tensors

  • All pipeline stages submitted sequentially to PVA

  • PVA SDK based synchronization ensures proper execution ordering

Phase 3: Reference Validation
  • CPU-based reference implementations execute after PVA processing

  • Tolerance-based comparison validates numerical accuracy

Multi-File Batch Processing:

The implementation supports efficient processing of multiple radar data files with operator reuse:

// Create operators once (shared across all files)
RadarPipelineOperators operators{};
RadarPipelineTensorRequirements tensorReqs(sampleCount, rxAntennaCount,
                                           chirpCount, dopplerFoldCount, txAntennaCount);
radar_create_pva_workloads(operators, tensorReqs, txAntennaCount, dopplerFoldCount);

// Process each file
for (const auto &file : files) {
    // Tensors created per file (RAII cleanup)
    RadarPipelineTensors tensors(allocatorHandle, sampleCount, rxAntennaCount,
                                 chirpCount, dopplerFoldCount, txAntennaCount);

    // Submit workloads and execute
    radar_submit_pva_workloads(operators, tensors, pvaWorkloadParams);
    radar_execute_ref_workloads(tensors, txAntennaCount, rxAntennaCount, dopplerFoldCount);
}  // Tensors automatically destroyed here

Pipeline Stages#

  1. Range FFT Processing

    • Purpose: Converts time-domain radar samples to range-frequency domain for target range estimation

    • Algorithm: Windowed FFT with Hanning window (PVA_BATCH_FFT_WINDOW_HANNING) to minimize spectral leakage

    • Input: Raw radar ADC samples [sampleCount=512][rxAntennaCount][chirpCount=512] as 32-bit signed integers

    • Output: Complex range FFT data [chirpCount=512][rxAntennaCount][NbNci=224] as complex 32-bit signed integers

    • Window Function: Pre-computed Hanning window coefficients applied to reduce sidelobe artifacts

    • Validation: Reference implementation comparison with configurable tolerance (0.0f for exact match)

  2. Doppler FFT Processing

    • Purpose: Extracts target velocity information through coherent processing across pulse repetition intervals

    • Algorithm: Windowed FFT with output transpose (transposeOutput = 1) for optimal memory layout

    • Input: Range FFT output [chirpCount][rxAntennaCount][224] as complex 32-bit signed integers

    • Output: Doppler FFT data [224][rxAntennaCount][chirpCount] as complex 32-bit signed integers (transposed)

    • Window Function: Range-domain Hanning window to suppress range sidelobes in velocity processing

    • Memory Optimization: Transpose operation optimizes subsequent NCI processing access patterns

  3. NCI (Non Coherent Integration) Processing

    • Purpose: Performs noise estimation and non-coherent integration to improve signal-to-noise ratio

    • Algorithm: Multi-output non-coherent integration with configurable doppler fold processing

    • Input: Doppler FFT output [224][rxAntennaCount][chirpCount] with 20-bit Q-format precision

    • Processing Parameters:

      • dopplerFoldCount = PVA_RADAR_DOPPLER_FOLD_COUNT: Non-coherent integration factor for SNR improvement

    • Outputs:

      • NCI RX: [Nb=224][chirpCount=512] - Receive-channel non-coherent integration results

      • NCI Final: [Nb=224][chirpCount/dopplerFoldCount=64] - Final non-coherent integration across doppler fold

      • Noise Estimate: [Nb=224] - Range-dependent noise floor estimation

  4. Peak Detection

    • Purpose: Identifies potential targets by detecting peaks above noise threshold

    • Algorithm: Configurable threshold-based peak detection with spatial clustering and Doppler disambiguation

    • Input: All NCI outputs (original input, RX, final, noise estimate)

    • Processing Parameters:

      • txAntennaCount: Number of transmit channels for beamforming considerations

      • dopplerFoldCount: Coherent integration factor for detection threshold calculation

    • Outputs:

      • Peak Count: [1] - Number of detected peaks (up to PVA_RADAR_MAX_TARGET_COUNT = 8192)

      • Peak Indices: [PVA_RADAR_PEAKDET_NUM_PEAK_INDICES][8192] - Range, Doppler, and receive channel indices for each peak

      • Peak Snap: [8192][16] - Local neighborhood data around each peak for sub-bin interpolation

    • Capacity: Supports up to 8192 simultaneous target detections

  5. DOA (Direction of Arrival) Processing

    • Purpose: Calculates target position, velocity, and angular information using array processing

    • Algorithm: FFT-based digital beamforming and dual-aperture interferometry with geometric parameter (GP) calibration

    • Input Components:

      • Peak detection outputs (count, indices, snap data)

      • Calibration vector [1][PVA_RADAR_NUM_TOTAL_ANTENNA_ELEMENTS] - Complex antenna array calibration coefficients

      • NCI final data for amplitude/phase reference

      • Geometric parameters (PVARadarGP GP) for coordinate transformation

    • Calibration: populateCalibVector(calibVector, 30) - 30-element calibration for array processing

    • Output: DOA results [PVA_RADAR_NUM_TARGET_DETECTION_PROPERTIES][PVA_RADAR_MAX_TARGET_COUNT] containing:

      • Velocity: Target radial velocity (m/s)

      • Range: Target distance (m)

      • Azimuth: Horizontal angle (degrees)

      • Elevation: Vertical angle (degrees)

      • X, Y, Z: Cartesian coordinates (m)

    • Validation: Tolerance-based comparison (2e-4f) between PVA and reference implementations

Classes and Components#

RadarPipelineTensorRequirements#

A C++ class that holds tensor requirements without actual memory allocation. Used for operator creation.

Constructor Parameters:

  • sampleCount (int32_t): Number of samples per chirp

  • rxAntennaCount (int32_t): Number of receive antenna channels

  • chirpCount (int32_t): Number of chirps per frame

  • dopplerFoldCount (int32_t): Doppler fold factor for non-coherent integration

  • txAntennaCount (int32_t): Number of transmit antenna channels

Key Components:

  • Tensor Requirements: Requirements for all pipeline tensors (shape, layout, data type)

  • NCI Output Requirements Vector: outNciTensorReqs - Pointer array for NCI output requirements

  • DOA Input Requirements Vector: inDOATensorReqs - Pointer array for DOA input requirements

  • Geometric Parameters: PVARadarGP GP - Radar system geometric calibration

RadarPipelineTensors#

A C++ class that manages all tensor allocations for processing a single radar data file. Uses RAII for automatic cleanup.

Constructor Parameters:

  • allocatorHandle (NVCVAllocatorHandle): PVA memory allocator

  • sampleCount (int32_t): Number of samples per chirp

  • rxAntennaCount (int32_t): Number of receive antenna channels

  • chirpCount (int32_t): Number of chirps per frame

  • dopplerFoldCount (int32_t): Doppler fold factor for non-coherent integration

  • txAntennaCount (int32_t): Number of transmit antenna channels

Key Methods:

void initRadarPipelineTensors();
void cleanupRadarPipelineTensors();
void createRadarPipelineTensors(NVCVAllocatorHandle allocatorHandle,
                                const int32_t sampleCount,
                                const int32_t rxAntennaCount,
                                const int32_t chirpCount,
                                const int32_t dopplerFoldCount,
                                const int32_t txAntennaCount);

Tensor Categories:

  • Range FFT Tensors: Input, window, output, and reference tensors

  • Doppler FFT Tensors: Input, window, output, and reference tensors

  • NCI Tensors: Input and multiple output tensors for different NCI stages

  • Peak Detection Tensors: Output tensors for peak count, indices, and snap data

  • DOA Tensors: Input calibration vectors and output result tensors

  • Geometric Parameters: PVARadarGP GP - Shared with requirements object

RadarPipelineOperators#

A C++ class that manages PVA operator handles. Created once and reused across multiple files.

Key Components:

  • Operator Handles: For each processing stage (Range FFT, Doppler FFT, NCI, Peak Detection, DOA)

  • Lifetime: Created once per radar configuration, destroyed at end of application

Key Methods:

void initRadarPipelineOperators();
void cleanupRadarPipelineOperators();

RadarPipelinePVAWorkloadParams#

A C++ class that handles CUPVA synchronization objects and allocator.

Key Components:

  • Allocator Handle: NVCVAllocatorHandle - PVA memory allocator for tensor allocation

  • CUPVA Objects: Synchronization objects (sync, fence, stream) for PVA execution

Key Methods:

void initRadarPipelinePVAWorkloadParams();
void cleanupRadarPipelinePVAWorkloadParams();

API Functions#

Workload Management Functions#

radar_create_pva_workloads()#

Creates all PVA operators for the radar pipeline. Called once per configuration.

Parameters:

  • operators (RadarPipelineOperators&): Operator management object (output)

  • tensorReqs (RadarPipelineTensorRequirements&): Tensor requirements object

  • NofTx (int32_t): Number of transmit channels

  • repeatFold (int32_t): Doppler fold factor

Returns: 0 on success, non-zero error code on failure

radar_submit_pva_workloads()#

Submits all PVA workloads for execution.

Parameters:

  • operators (RadarPipelineOperators&): Operator management object

  • tensors (RadarPipelineTensors&): Tensor management object for current file

  • pvaWorkloadParams (RadarPipelinePVAWorkloadParams&): PVA workload parameters

Returns: 0 on success, non-zero error code on failure

radar_execute_ref_workloads()#

Executes reference implementations and validates PVA outputs.

Parameters:

  • tensors (RadarPipelineTensors&): Tensor management object

  • NofTx (int32_t): Number of transmit channels

  • NofRx (int32_t): Number of receive channels

  • repeatFold (int32_t): Doppler fold factor

Returns: 0 on success, non-zero error code on failure

Processing Stage Functions#

range_fft_processing()#

Processes Range FFT stage with validation against reference implementation.

Parameters:

  • tensors (RadarPipelineTensors&): Tensor management object

Returns: 0 on success, non-zero error code on failure

doppler_fft_processing()#

Processes Doppler FFT stage with validation against reference implementation.

Parameters:

  • tensors (RadarPipelineTensors&): Tensor management object

Returns: 0 on success, non-zero error code on failure

nci_processing()#

Processes NCI stage with configurable tolerance values.

Parameters:

  • tensors (RadarPipelineTensors&): Tensor management object

Returns: 0 on success, non-zero error code on failure

peak_detection()#

Performs peak detection and extracts peak information.

Parameters:

  • tensors (RadarPipelineTensors&): Tensor management object

  • NofTx (int32_t): Number of transmit channels

  • NofRx (int32_t): Number of receive channels

  • repeatFold (int32_t): Doppler fold factor

  • peakCount (int32_t&): Output parameter for number of detected peaks

Returns: 0 on success, non-zero error code on failure

doa_processing()#

Calculates direction of arrival and target information.

Parameters:

  • tensors (RadarPipelineTensors&): Tensor management object

  • peakCount (int32_t): Number of detected peaks from peak detection stage

Returns: 0 on success, non-zero error code on failure

Utility Functions#

readFilesWithPrefix()#

Reads files with a specific prefix from a directory.

Parameters:

  • directoryPath (const std::string&): Directory to search

  • prefix (const std::string&): File prefix to match

Returns: Vector of matching file paths

Configuration and Constants#

Default Pipeline Parameters#

const int32_t sampleCount      = 512;  // Number of samples per chirp
const int32_t chirpCount       = 512;  // Number of chirps per frame
const int32_t rxAntennaCount   = PVA_RADAR_RX_ANTENNA_COUNT;  // Receive channels (4)
const int32_t txAntennaCount   = PVA_RADAR_TX_ANTENNA_COUNT;  // Transmit channels (4)
const int32_t dopplerFoldCount = PVA_RADAR_DOPPLER_FOLD_COUNT;  // Doppler fold factor (8)

Tensor Dimensions#

  • Range FFT: [chirpCount][rxAntennaCount][224]

  • Doppler FFT: [224][rxAntennaCount][chirpCount]

  • NCI RX: [224][chirpCount]

  • NCI Final: [224][chirpCount/dopplerFoldCount]

  • Noise Estimate: [224]

  • Peak Indices: [PVA_RADAR_PEAKDET_NUM_PEAK_INDICES][8192]

  • Peak Snap: [8192][16]

  • DOA Output: [PVA_RADAR_NUM_TARGET_DETECTION_PROPERTIES][8192]

Usage Example#

Basic Pipeline Execution#

#include "radar_pipeline_pva.hpp"
#include "radar_pipeline_reference.hpp"
#include "radar_pipeline_tensors.hpp"
#include "range_fft_ref.h"
#include "utils/radar_file_parser.h"

int main(int argc, char** argv) {
    // Pipeline parameters
    const int32_t sampleCount      = 512;
    const int32_t chirpCount       = 512;
    const int32_t rxAntennaCount   = PVA_RADAR_RX_ANTENNA_COUNT;
    const int32_t txAntennaCount   = PVA_RADAR_TX_ANTENNA_COUNT;
    const int32_t dopplerFoldCount = PVA_RADAR_DOPPLER_FOLD_COUNT;

    // Initialize PVA workload parameters
    RadarPipelinePVAWorkloadParams pvaWorkloadParams{};

    // Create operators once (shared across all files)
    RadarPipelineOperators operators{};
    RadarPipelineTensorRequirements tensorReqs(sampleCount, rxAntennaCount,
                                               chirpCount, dopplerFoldCount, txAntennaCount);

    // Create PVA operators
    printf("Creating PVA operators\n");
    int err = radar_create_pva_workloads(operators, tensorReqs,
                                         txAntennaCount, dopplerFoldCount);
    if (err != 0) {
        printf("Failed to create PVA workloads: %d\n", err);
        return err;
    }

    // Read input data files
    std::string assetsDir = "/path/to/assets/radar/";
    std::vector<std::string> files = readFilesWithPrefix(assetsDir, "data_");

    // Process each file
    for (const auto &file : files) {
        printf("Processing file %s\n", file.c_str());

        // Read raw data
        auto raw_data = rsps_RawFileRead<int32_t>(file);

        // Create tensors for this file (RAII cleanup)
        RadarPipelineTensors tensors(pvaWorkloadParams.allocatorHandle,
                                     sampleCount, rxAntennaCount, chirpCount,
                                     dopplerFoldCount, txAntennaCount);

        // Load input data
        load_range_fft_input(tensors.inRangeFFTTensorHandle, raw_data.second);

        // Submit PVA workloads
        printf("Submitting PVA workloads\n");
        err = radar_submit_pva_workloads(operators, tensors, pvaWorkloadParams);
        if (err != 0) {
            printf("Failed to submit PVA workloads: %d\n", err);
            break;
        }

        // Execute and validate
        printf("Executing reference workloads\n");
        err = radar_execute_ref_workloads(tensors, txAntennaCount,
                                          rxAntennaCount, dopplerFoldCount);
        if (err != 0) {
            printf("Reference workload execution failed: %d\n", err);
            break;
        }

        printf("Successfully processed file\n");
    }  // Tensors automatically destroyed here

    return err;
}

Command Line Usage#

# Basic usage
./radar_pipeline_test

# Show help
./radar_pipeline_test --help
./radar_pipeline_test -h
./radar_pipeline_test -a /path/to/assets

Error Handling#

The pipeline uses comprehensive error checking throughout all stages:

  • NVCV_CHECK_ERROR_GOTO: For NVCV API calls

  • CUPVA_CHECK_ERROR_GOTO: For CUPVA API calls

  • Return code validation: All functions return 0 on success, non-zero on error

  • Resource cleanup: Automatic cleanup in destructors and error paths

Common error scenarios:

  • Asset directory not found

  • Input data file missing or corrupted

  • PVA operator creation failure

  • Tensor allocation failure

  • Output validation mismatch

Performance Considerations#

  • Memory Management: Uses PVA allocator for optimal memory placement

  • Parallel Processing: Multiple files can be processed in batches

  • Resource Reuse: Tensors and operators are reused across files

Dependencies#

  • NVCV: NVIDIA Computer Vision library

  • PVA: Programmable Vision Accelerator

  • PVA-SDK: PVA SDK library

  • Standard C++: For file I/O and container operations

  • Radar Operators: Radar operators library

  • Radar File Parser: Radar file parser library

  • Radar Range FFT Reference: Radar range FFT reference implementation

  • Radar Doppler FFT Reference: Radar Doppler FFT reference implementation

  • Radar NCI Reference: Radar NCI reference implementation

  • Radar Peak Detection Reference: Radar Peak Detection reference implementation

  • Radar DOA Reference: Radar DOA reference implementation

Performance#

The performance of the radar pipeline is primarily determined by the size of the input tensor and the number of detected peaks, which is decided by the pipeline stages.

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores. Idle power is approximately 7W when the PVA is not processing data.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

SampleCount

RxAntennaCount

ChirpCount

DataFile

InputDataType

OutputDataType

Execution Time

Submit Latency

Total Power

512

4

512

data_1

S32

F32

1.456ms

0.287ms

14.754W