ASR Pipeline Architecture#
This guide provides a comprehensive overview of NeMo Curator’s automatic speech recognition (ASR) pipeline architecture, covering audio input processing through transcription generation and quality assessment.
Pipeline Overview#
The ASR pipeline in NeMo Curator follows a systematic approach to speech processing:
graph TD A[Audio Files] --> B[AudioBatch Creation] B --> C[ASR Model Loading] C --> D[Batch Inference] D --> E[Transcription Output] E --> F[Quality Assessment] F --> G[Filtering & Export] subgraph "Input Stage" A B end subgraph "Processing Stage" C D E end subgraph "Assessment Stage" F G end
Core Components#
1. Audio Input Management#
AudioBatch Structure: The foundation for audio processing
Contains audio file paths and associated metadata
Validates file existence and accessibility automatically
Supports efficient batch processing for scalability
Input Validation: Ensures data integrity before processing
File path existence checks using
AudioBatch.validate()
andvalidate_item()
Optional metadata validation added by downstream stages (such as duration and format checks)
2. ASR Model Integration#
NeMo Framework Integration: Leverages state-of-the-art ASR models
Automatic model downloading and caching for convenience
GPU-accelerated inference when hardware is available
Support for multilingual and domain-specific model variants
Model Management: Efficient resource usage
Lazy loading of models to conserve system memory
Automatic GPU or CPU device selection based on available resources
Model-level batching handled within NeMo framework
3. Inference Processing#
Batch Processing: Supports processing audio files together
Audio files process together in a single call to the NeMo ASR model
Batch size configuration controls task grouping for processing using
.with_(batch_size=..., resources=Resources(...))
Internal batching and optimization handled by the NeMo framework
Output Generation: Structured transcription results
Clean predicted text extraction from NeMo model outputs
Complete metadata preservation throughout the processing pipeline
Processing Stages#
Stage 1: Data Loading#
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
from nemo_curator.stages.text.io.reader import JsonlReader
# Data loading from datasets (e.g., FLEURS)
fleurs_stage = CreateInitialManifestFleursStage(
lang="en_us", # Language code
split="dev", # Data split
raw_data_dir="/path/to/data"
)
# Or load from custom manifest files
manifest_reader = JsonlReader(
input_file_path="/path/to/manifest.jsonl"
)
# Stages automatically create AudioBatch objects from loaded data
Stage 2: ASR Model Setup#
# Model initialization
asr_stage = InferenceAsrNemoStage(
model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
)
# GPU/CPU device selection (based on configured resources)
device = asr_stage.check_cuda()
# Model loading
asr_stage.setup() # Downloads and loads model
Stage 3: Transcription Generation#
# ASR stage processes AudioBatch objects automatically
# The stage extracts file paths and calls transcribe() internally
processed_batch = asr_stage.process(audio_batch)
# Output: AudioBatch with added "pred_text" field
# Each item now contains both original data and predictions
Stage 4: Quality Assessment Integration#
# WER calculation
wer_stage = GetPairwiseWerStage(
text_key="text",
pred_text_key="pred_text",
wer_key="wer"
)
# Duration analysis
duration_stage = GetAudioDurationStage(
audio_filepath_key="audio_filepath",
duration_key="duration"
)
Data Flow Architecture#
Input Data Flow#
Audio Files → File system
Manifest Files → JSONL format with metadata
AudioBatch Objects → Validated, structured data containers
Processing Data Flow#
Model Loading → NeMo ASR model initialization
Batch Creation → Group audio files for efficient processing
GPU Processing → Transcription generation
Result Aggregation → Combine transcriptions with metadata
Output Data Flow#
Transcription Results → Predicted text for each audio file
Quality Metrics → WER, CER, duration, and custom scores
Filtered Datasets → High-quality audio-text pairs
Export Formats → JSONL manifests for training workflows
Performance Characteristics#
Scalability Factors#
Model Selection Impact:
Larger models provide better accuracy but require more processing time
NeMo models support streaming capabilities, though this stage performs offline transcription
Language-specific models improve accuracy for target languages
Hardware Usage:
GPU acceleration typically outperforms CPU processing for larger workloads
Memory requirements scale proportionally with model size and audio input lengths
Optimization Strategies#
Memory Management:
# Optimize for memory-constrained environments
asr_stage = InferenceAsrNemoStage(
model_name="nvidia/stt_en_fastconformer_hybrid_small" # Smaller model
).with_(
resources=Resources(gpus=0.5) # Request fractional GPU via executor/backends
)
Resource Configuration:
# Configure resources for processing
asr_stage = InferenceAsrNemoStage(
model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
).with_(
resources=Resources(gpus=1.0) # Dedicated GPU
)
Error Handling and Recovery#
Audio Processing Errors#
# Validate and filter invalid file paths
audio_batch = AudioBatch(data=audio_data, filepath_key="audio_filepath")
# Filter out entries that do not exist on disk
valid_samples = [item for item in audio_batch.data if audio_batch.validate_item(item)]
Pipeline Recovery#
For guidance on resumable processing and recovery at the executor and backend level, refer to Resumable Processing.
Integration Points#
Text Processing Integration#
The ASR pipeline seamlessly integrates with text processing workflows:
# Audio → Text pipeline
audio_to_text = [
InferenceAsrNemoStage(), # Audio → Transcriptions
AudioToDocumentStage(), # AudioBatch → DocumentBatch
# Continue with text processing stages...
]