ASR Pipeline | NeMo Curator

This guide provides a comprehensive overview of NeMo Curator’s Automatic Speech Recognition (ASR) pipeline architecture, covering audio input processing through transcription generation and quality assessment.

Pipeline Overview

The ASR pipeline in NeMo Curator follows a systematic approach to speech processing:

Core Components

1. Audio Input Management

AudioTask Structure: The foundation for audio processing

Contains audio file paths and associated metadata
Validates file existence and accessibility automatically
Supports efficient batch processing for scalability

Input Validation: Ensures data integrity before processing

File path existence checks using AudioTask.validate() and validate_item()
Optional metadata validation added by downstream stages (such as duration and format checks)

2. ASR Model Integration

NeMo Framework Integration: Leverages state-of-the-art ASR models

Automatic model downloading and caching for convenience
GPU-accelerated inference when hardware is available
Support for multilingual and domain-specific model variants

Model Management: Efficient resource usage

Lazy loading of models to conserve system memory
Automatic GPU or CPU device selection based on available resources
Model-level batching handled within NeMo framework

3. Inference Processing

Batch Processing: Supports processing audio files together

Audio files are processed together in a single call to the NeMo ASR model
Batch size configuration controls task grouping for processing using .with_(batch_size=..., resources=Resources(...))
Internal batching and optimization handled by the NeMo framework

Output Generation: Structured transcription results

Clean predicted text extraction from NeMo model outputs
Complete metadata preservation throughout the processing pipeline

Processing Stages

Stage 1: Data Loading

1 from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 
4 # Data loading from datasets (e.g., FLEURS)
5 fleurs_stage = CreateInitialManifestFleursStage(
6     lang="en_us",              # Language code
7     split="dev",               # Data split
8     raw_data_dir="/path/to/data"
9 )
10 
11 # Or load from custom manifest files
12 manifest_reader = JsonlReader(
13     input_file_path="/path/to/manifest.jsonl"
14 )
15 
16 # Stages automatically create AudioTask objects from loaded data

Stage 2: ASR Model Setup

1 # Model initialization
2 asr_stage = InferenceAsrNemoStage(
3     model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
4 )
5 
6 # GPU/CPU device selection (based on configured resources)
7 device = asr_stage.check_cuda()
8 
9 # Model loading
10 asr_stage.setup()  # Downloads and loads model

Stage 3: Transcription Generation

1 # Don't call process() directly — the Pipeline/Executor handles dispatch:
2 pipeline.add_stage(asr_stage)
3 results = pipeline.run(executor)
4 
5 # Output: AudioTask objects with added "pred_text" field
6 # Each task now contains both original data and predictions

Stage 4: Quality Assessment Integration

1 # WER calculation
2 wer_stage = GetPairwiseWerStage(
3     text_key="text",
4     pred_text_key="pred_text",
5     wer_key="wer"
6 )
7 
8 # Duration analysis
9 duration_stage = GetAudioDurationStage(
10     audio_filepath_key="audio_filepath",
11     duration_key="duration"
12 )

Data Flow Architecture

Input Data Flow

Audio Files → File system
Manifest Files → JSONL format with metadata
AudioTask Objects → Validated, structured data containers

Processing Data Flow

Model Loading → NeMo ASR model initialization
Batch Creation → Group audio files for efficient processing
GPU Processing → Transcription generation
Result Aggregation → Combine transcriptions with metadata

Output Data Flow

Transcription Results → Predicted text for each audio file
Quality Metrics → WER, CER, duration, and custom scores
Filtered Datasets → High-quality audio-text pairs
Export Formats → JSONL manifests for training workflows

Performance Characteristics

Scalability Factors

Model Selection Impact:

Larger models provide better accuracy but require more processing time
NeMo models support streaming capabilities, though this stage performs offline transcription
Language-specific models improve accuracy for target languages

Hardware Usage:

GPU acceleration typically outperforms CPU processing for larger workloads
Memory requirements scale proportionally with model size and audio input lengths

Optimization Strategies

Memory Management:

1 # Optimize for memory-constrained environments
2 asr_stage = InferenceAsrNemoStage(
3     model_name="nvidia/stt_en_fastconformer_hybrid_small"  # Smaller model
4 ).with_(
5     resources=Resources(gpus=0.5)  # Request fractional GPU using executor/backends
6 )

Resource Configuration:

1 # Configure resources for processing
2 asr_stage = InferenceAsrNemoStage(
3     model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
4 ).with_(
5     resources=Resources(gpus=1.0)  # Dedicated GPU
6 )

Error Handling and Recovery

Audio Processing Errors

1 # Validate and filter invalid file paths
2 audio_task = AudioTask(data=audio_data, filepath_key="audio_filepath")
3 
4 # Validate the audio file exists on disk
5 is_valid = audio_task.validate()

Pipeline Recovery

For guidance on resumable processing and recovery at the executor and backend level, refer to Resumable Processing.

Integration Points

Text Processing Integration

The ASR pipeline seamlessly integrates with text processing workflows:

1 # Audio → Text pipeline
2 audio_to_text = [
3     InferenceAsrNemoStage(),  # Audio → Transcriptions
4     AudioToDocumentStage(),   # AudioTask → DocumentBatch
5     # Continue with text processing stages...
6 ]