***

description: >-
Comprehensive overview of the automatic speech recognition pipeline
architecture and workflow in NeMo Curator
categories:

* concepts-architecture
  tags:
* asr-pipeline
* speech-recognition
* architecture
* workflow
* nemo-toolkit
  personas:
* data-scientist-focused
* mle-focused
  difficulty: beginner
  content\_type: concept
  modality: audio-only

***

# ASR Pipeline Architecture

This guide provides a comprehensive overview of NeMo Curator's Automatic Speech Recognition (ASR) pipeline architecture, covering audio input processing through transcription generation and quality assessment.

## Pipeline Overview

The ASR pipeline in NeMo Curator follows a systematic approach to speech processing:

```mermaid
graph TD
    A[Audio Files] --> B[AudioBatch Creation]
    B --> C[ASR Model Loading]
    C --> D[Batch Inference]
    D --> E[Transcription Output]
    E --> F[Quality Assessment]
    F --> G[Filtering & Export]

    subgraph "Input Stage"
        A
        B
    end

    subgraph "Processing Stage"
        C
        D
        E
    end

    subgraph "Assessment Stage"
        F
        G
    end
```

## Core Components

### 1. Audio Input Management

**AudioBatch Structure**: The foundation for audio processing

* Contains audio file paths and associated metadata
* Validates file existence and accessibility automatically
* Supports efficient batch processing for scalability

**Input Validation**: Ensures data integrity before processing

* File path existence checks using `AudioBatch.validate()` and `validate_item()`
* Optional metadata validation added by downstream stages (such as duration and format checks)

### 2. ASR Model Integration

**NeMo Framework Integration**: Leverages state-of-the-art ASR models

* Automatic model downloading and caching for convenience
* GPU-accelerated inference when hardware is available
* Support for multilingual and domain-specific model variants

**Model Management**: Efficient resource usage

* Lazy loading of models to conserve system memory
* Automatic GPU or CPU device selection based on available resources
* Model-level batching handled within NeMo framework

### 3. Inference Processing

**Batch Processing**: Supports processing audio files together

* Audio files are processed together in a single call to the NeMo ASR model
* Batch size configuration controls task grouping for processing using `.with_(batch_size=..., resources=Resources(...))`
* Internal batching and optimization handled by the NeMo framework

**Output Generation**: Structured transcription results

* Clean predicted text extraction from NeMo model outputs
* Complete metadata preservation throughout the processing pipeline

## Processing Stages

### Stage 1: Data Loading

```python
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
from nemo_curator.stages.text.io.reader import JsonlReader

# Data loading from datasets (e.g., FLEURS)
fleurs_stage = CreateInitialManifestFleursStage(
    lang="en_us",              # Language code
    split="dev",               # Data split
    raw_data_dir="/path/to/data"
)

# Or load from custom manifest files
manifest_reader = JsonlReader(
    input_file_path="/path/to/manifest.jsonl"
)

# Stages automatically create AudioBatch objects from loaded data
```

### Stage 2: ASR Model Setup

```python
# Model initialization
asr_stage = InferenceAsrNemoStage(
    model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
)

# GPU/CPU device selection (based on configured resources)
device = asr_stage.check_cuda()

# Model loading
asr_stage.setup()  # Downloads and loads model
```

### Stage 3: Transcription Generation

```python
# ASR stage processes AudioBatch objects automatically
# The stage extracts file paths and calls transcribe() internally
processed_batch = asr_stage.process(audio_batch)

# Output: AudioBatch with added "pred_text" field
# Each item now contains both original data and predictions
```

### Stage 4: Quality Assessment Integration

```python
# WER calculation
wer_stage = GetPairwiseWerStage(
    text_key="text",
    pred_text_key="pred_text",
    wer_key="wer"
)

# Duration analysis
duration_stage = GetAudioDurationStage(
    audio_filepath_key="audio_filepath",
    duration_key="duration"
)
```

## Data Flow Architecture

### Input Data Flow

1. **Audio Files** → File system
2. **Manifest Files** → JSONL format with metadata
3. **AudioBatch Objects** → Validated, structured data containers

### Processing Data Flow

1. **Model Loading** → NeMo ASR model initialization
2. **Batch Creation** → Group audio files for efficient processing
3. **GPU Processing** → Transcription generation
4. **Result Aggregation** → Combine transcriptions with metadata

### Output Data Flow

1. **Transcription Results** → Predicted text for each audio file
2. **Quality Metrics** → WER, CER, duration, and custom scores
3. **Filtered Datasets** → High-quality audio-text pairs
4. **Export Formats** → JSONL manifests for training workflows

## Performance Characteristics

### Scalability Factors

**Model Selection Impact**:

* Larger models provide better accuracy but require more processing time
* NeMo models support streaming capabilities, though this stage performs offline transcription
* Language-specific models improve accuracy for target languages

**Hardware Usage**:

* GPU acceleration typically outperforms CPU processing for larger workloads
* Memory requirements scale proportionally with model size and audio input lengths

### Optimization Strategies

**Memory Management**:

```python
# Optimize for memory-constrained environments
asr_stage = InferenceAsrNemoStage(
    model_name="nvidia/stt_en_fastconformer_hybrid_small"  # Smaller model
).with_(
    resources=Resources(gpus=0.5)  # Request fractional GPU using executor/backends
)
```

**Resource Configuration**:

```python
# Configure resources for processing
asr_stage = InferenceAsrNemoStage(
    model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
).with_(
    resources=Resources(gpus=1.0)  # Dedicated GPU
)
```

## Error Handling and Recovery

### Audio Processing Errors

```python
# Validate and filter invalid file paths
audio_batch = AudioBatch(data=audio_data, filepath_key="audio_filepath")

# Filter out entries that do not exist on disk
valid_samples = [item for item in audio_batch.data if audio_batch.validate_item(item)]
```

### Pipeline Recovery

For guidance on resumable processing and recovery at the executor and backend level, refer to [Resumable Processing](/reference/infra/resumable-processing).

## Integration Points

### Text Processing Integration

The ASR pipeline seamlessly integrates with text processing workflows:

```python
# Audio → Text pipeline
audio_to_text = [
    InferenceAsrNemoStage(),  # Audio → Transcriptions
    AudioToDocumentStage(),   # AudioBatch → DocumentBatch
    # Continue with text processing stages...
]
```
