Overview | NeMo Curator

Perform automatic speech recognition (ASR) on audio files using NeMo Framework models. The ASR inference stage transcribes audio into text, enabling downstream quality assessment and text processing workflows.

How it Works

The InferenceAsrNemoStage processes AudioTask objects by:

Input Validation: Verifies required attributes and data structure
Model Loading: Downloads and initializes NeMo ASR models on GPU or CPU
Batch Processing: Groups audio files for efficient inference
Transcription: Generates text predictions for each audio file
Output Creation: Returns AudioTask with original data plus predicted transcriptions

Basic Usage

Simple ASR Inference

1 from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
2 from nemo_curator.stages.resources import Resources
3 
4 # Create ASR inference stage
5 asr_stage = InferenceAsrNemoStage(
6     model_name="nvidia/stt_en_fastconformer_hybrid_large_pc",
7     filepath_key="audio_filepath",
8     pred_text_key="pred_text"
9 )
10 
11 # Configure for GPU processing
12 asr_stage = asr_stage.with_(
13     resources=Resources(gpus=1.0),
14     batch_size=16
15 )

Multilingual ASR

1 # Use language-specific models
2 language_models = {
3     "en_us": "nvidia/stt_en_fastconformer_hybrid_large_pc",
4     "es_419": "nvidia/stt_es_fastconformer_hybrid_large_pc",
5     "hy_am": "nvidia/stt_hy_fastconformer_hybrid_large_pc",
6 }
7 
8 # Create stage for Armenian
9 armenian_asr = InferenceAsrNemoStage(
10     model_name=language_models["hy_am"]
11 )

Configuration Options

Model Selection

NeMo Framework provides ready-to-use ASR models for several languages and domains:

1 # Domain-specific models
2 models = {
3     "general": "nvidia/stt_en_fastconformer_hybrid_large_pc",
4     "telephony": "nvidia/stt_en_fastconformer_telephony_large",
5     "streaming": "nvidia/stt_en_fastconformer_streaming_large",
6 }

Resource Configuration

1 from nemo_curator.stages.resources import Resources
2 
3 # GPU configuration
4 asr_stage = asr_stage.with_(
5     resources=Resources(
6         gpus=1.0,           # Number of GPUs (multi-GPU aware stages)
7         cpus=4.0            # CPU cores
8     )
9 )
10 
11 # Alternatively, request fractional single-GPU memory (do not combine with gpus):
12 # asr_stage = asr_stage.with_(resources=Resources(cpus=4.0, gpu_memory_gb=16.0))

Batch Processing

1 # Optimize batch size based on GPU memory
2 asr_stage = asr_stage.with_(
3     batch_size=32  # Larger batches improve GPU utilization
4 )

batch_size controls the number of tasks the executor groups per call. The ASR stage defines process_batch() as its canonical method — the executor groups tasks by batch_size before calling it.

Within a single AudioTask, process_batch() transcribes the audio file path.

Input Requirements

AudioTask Format

Data loading stages create input AudioTask objects that must contain:

1 # AudioTask's data object structure (created automatically by loading stages)
2 # Each item in the batch contains:
3 {
4     "audio_filepath": "/path/to/audio1.wav",
5     # Optional: existing metadata
6     "duration": 5.2,
7     "language": "en"
8 }

Audio File Requirements

Supported Formats: Determined by the selected NeMo ASR model; refer to the NeMo ASR documentation.
Sample Rates: Typically 16 kHz; refer to the model card for details.
Channels: Mono or stereo; channel handling (for example, down-mixing) depends on the model.
Duration: Long files can require manual chunking before inference.

Output Structure

The ASR stage adds predicted transcriptions to each audio sample:

1 # Output AudioTask's data object structure
2 {
3     "audio_filepath": "/path/to/audio1.wav",
4     "pred_text": "this is the predicted transcription",
5     "duration": 5.2,  # Preserved from input
6     "language": "en"  # Preserved from input
7 }

Error Handling

Model Loading Errors

1 try:
2     asr_stage.setup()
3 except RuntimeError as e:
4     print(f"Failed to load ASR model: {e}")
5     # Fallback to CPU or different model

Processing Errors

Processing behavior:

Input structure validation: The stage uses validate_input() to check required attributes/columns and raises ValueError if they are missing.
Model loading failures: setup() raises RuntimeError if model download or initialization fails.
No automatic retries or auto-tuning: The stage does not perform automatic batch size reduction or network retries.
Missing files: AudioTask.validate() can log file-existence warnings when code creates tasks; the stage does not auto-skip files.

Performance Optimization

GPU Memory Management

1 # For large models or limited GPU memory
2 asr_stage = InferenceAsrNemoStage(
3     model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
4 ).with_(
5     batch_size=8,  # Reduce batch size
6     resources=Resources(gpus=0.5)  # Share GPU resources
7 )

Distributed Processing

1 from nemo_curator.backends.xenna import XennaExecutor
2 
3 # Configure executor (refer to Pipeline Execution Backends)
4 executor = XennaExecutor(
5     config={
6         "execution_mode": "streaming",
7         "logging_interval": 60,
8         "ignore_failures": False
9     }
10 )

Integration Examples

Complete Audio-to-Text Pipeline

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
3 
4 pipeline = Pipeline(name="audio_to_text")
5 
6 # ASR inference
7 pipeline.add_stage(asr_stage)
8 
9 # Convert to text format
10 pipeline.add_stage(AudioToDocumentStage())
11 
12 # Continue with text processing...