Curate AudioProcess DataASR Inference

ASR Inference

View as Markdown

Perform automatic speech recognition (ASR) on audio files using NeMo Framework models. The ASR inference stage transcribes audio into text, enabling downstream quality assessment and text processing workflows.

How it Works

The InferenceAsrNemoStage processes AudioBatch objects by:

  1. Input Validation: Verifies required attributes and data structure
  2. Model Loading: Downloads and initializes NeMo ASR models on GPU or CPU
  3. Batch Processing: Groups audio files for efficient inference
  4. Transcription: Generates text predictions for each audio file
  5. Output Creation: Returns AudioBatch with original data plus predicted transcriptions

Basic Usage

Simple ASR Inference

1from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
2from nemo_curator.stages.resources import Resources
3
4# Create ASR inference stage
5asr_stage = InferenceAsrNemoStage(
6 model_name="nvidia/stt_en_fastconformer_hybrid_large_pc",
7 filepath_key="audio_filepath",
8 pred_text_key="pred_text"
9)
10
11# Configure for GPU processing
12asr_stage = asr_stage.with_(
13 resources=Resources(gpus=1.0),
14 batch_size=16
15)

Multilingual ASR

1# Use language-specific models
2language_models = {
3 "en_us": "nvidia/stt_en_fastconformer_hybrid_large_pc",
4 "es_419": "nvidia/stt_es_fastconformer_hybrid_large_pc",
5 "hy_am": "nvidia/stt_hy_fastconformer_hybrid_large_pc",
6}
7
8# Create stage for Armenian
9armenian_asr = InferenceAsrNemoStage(
10 model_name=language_models["hy_am"]
11)

Configuration Options

Model Selection

NeMo Framework provides ready-to-use ASR models for several languages and domains:

1# Domain-specific models
2models = {
3 "general": "nvidia/stt_en_fastconformer_hybrid_large_pc",
4 "telephony": "nvidia/stt_en_fastconformer_telephony_large",
5 "streaming": "nvidia/stt_en_fastconformer_streaming_large",
6}

Resource Configuration

1from nemo_curator.stages.resources import Resources
2
3# GPU configuration
4asr_stage = asr_stage.with_(
5 resources=Resources(
6 gpus=1.0, # Number of GPUs (multi-GPU aware stages)
7 cpus=4.0 # CPU cores
8 )
9)
10
11# Alternatively, request fractional single-GPU memory (do not combine with gpus):
12# asr_stage = asr_stage.with_(resources=Resources(cpus=4.0, gpu_memory_gb=16.0))

Batch Processing

1# Optimize batch size based on GPU memory
2asr_stage = asr_stage.with_(
3 batch_size=32 # Larger batches improve GPU utilization
4)

batch_size controls the number of tasks the executor groups per call. The ASR stage does not define process_batch(); the executor batches tasks.

Within a single AudioBatch, process() transcribes the file paths together.

Input Requirements

AudioBatch Format

Data loading stages create input AudioBatch objects that must contain:

1# AudioBatch's data object structure (created automatically by loading stages)
2# Each item in the batch contains:
3{
4 "audio_filepath": "/path/to/audio1.wav",
5 # Optional: existing metadata
6 "duration": 5.2,
7 "language": "en"
8}

Audio File Requirements

  • Supported Formats: Determined by the selected NeMo ASR model; refer to the NeMo ASR documentation.
  • Sample Rates: Typically 16 kHz; refer to the model card for details.
  • Channels: Mono or stereo; channel handling (for example, down-mixing) depends on the model.
  • Duration: Long files can require manual chunking before inference.

Output Structure

The ASR stage adds predicted transcriptions to each audio sample:

1# Output AudioBatch's data object structure
2{
3 "audio_filepath": "/path/to/audio1.wav",
4 "pred_text": "this is the predicted transcription",
5 "duration": 5.2, # Preserved from input
6 "language": "en" # Preserved from input
7}

Error Handling

Model Loading Errors

1try:
2 asr_stage.setup()
3except RuntimeError as e:
4 print(f"Failed to load ASR model: {e}")
5 # Fallback to CPU or different model

Processing Errors

Processing behavior:

  • Input structure validation: The stage uses validate_input() to check required attributes/columns and raises ValueError if they are missing.
  • Model loading failures: setup() raises RuntimeError if model download or initialization fails.
  • No automatic retries or auto-tuning: The stage does not perform automatic batch size reduction or network retries.
  • Missing files: AudioBatch.validate() can log file-existence warnings when code creates tasks; the stage does not auto-skip files.

Performance Optimization

GPU Memory Management

1# For large models or limited GPU memory
2asr_stage = InferenceAsrNemoStage(
3 model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
4).with_(
5 batch_size=8, # Reduce batch size
6 resources=Resources(gpus=0.5) # Share GPU resources
7)

Distributed Processing

1from nemo_curator.backends.xenna import XennaExecutor
2
3# Configure executor (refer to Pipeline Execution Backends)
4executor = XennaExecutor(
5 config={
6 "execution_mode": "streaming",
7 "logging_interval": 60,
8 "ignore_failures": False
9 }
10)

Integration Examples

Complete Audio-to-Text Pipeline

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
3
4pipeline = Pipeline(name="audio_to_text")
5
6# ASR inference
7pipeline.add_stage(asr_stage)
8
9# Convert to text format
10pipeline.add_stage(AudioToDocumentStage())
11
12# Continue with text processing...