> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/curator/_mcp/server.

> Perform automatic speech recognition using NeMo Framework models with GPU acceleration and batch processing

# ASR Inference

Perform automatic speech recognition (ASR) on audio files using NeMo Framework models. The ASR inference stage transcribes audio into text, enabling downstream quality assessment and text processing workflows.

## How it Works

The `InferenceAsrNemoStage` processes `AudioTask` objects by:

1. **Input Validation**: Verifies required attributes and data structure
2. **Model Loading**: Downloads and initializes NeMo ASR models on GPU or CPU
3. **Batch Processing**: Groups audio files for efficient inference
4. **Transcription**: Generates text predictions for each audio file
5. **Output Creation**: Returns `AudioTask` with original data plus predicted transcriptions

## Basic Usage

### Simple ASR Inference

```python
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from nemo_curator.stages.resources import Resources

# Create ASR inference stage
asr_stage = InferenceAsrNemoStage(
    model_name="nvidia/stt_en_fastconformer_hybrid_large_pc",
    filepath_key="audio_filepath",
    pred_text_key="pred_text"
)

# Configure for GPU processing
asr_stage = asr_stage.with_(
    resources=Resources(gpus=1.0),
    batch_size=16
)
```

### Multilingual ASR

```python
# Use language-specific models
language_models = {
    "en_us": "nvidia/stt_en_fastconformer_hybrid_large_pc",
    "es_419": "nvidia/stt_es_fastconformer_hybrid_large_pc",
    "hy_am": "nvidia/stt_hy_fastconformer_hybrid_large_pc",
}

# Create stage for Armenian
armenian_asr = InferenceAsrNemoStage(
    model_name=language_models["hy_am"]
)
```

## Configuration Options

### Model Selection

NeMo Framework provides ready-to-use ASR models for several languages and domains:

```python
# Domain-specific models
models = {
    "general": "nvidia/stt_en_fastconformer_hybrid_large_pc",
    "telephony": "nvidia/stt_en_fastconformer_telephony_large",
    "streaming": "nvidia/stt_en_fastconformer_streaming_large",
}
```

### Resource Configuration

```python
from nemo_curator.stages.resources import Resources

# GPU configuration
asr_stage = asr_stage.with_(
    resources=Resources(
        gpus=1.0,           # Number of GPUs (multi-GPU aware stages)
        cpus=4.0            # CPU cores
    )
)

# Alternatively, request fractional single-GPU memory (do not combine with gpus):
# asr_stage = asr_stage.with_(resources=Resources(cpus=4.0, gpu_memory_gb=16.0))
```

### Batch Processing

```python
# Optimize batch size based on GPU memory
asr_stage = asr_stage.with_(
    batch_size=32  # Larger batches improve GPU utilization
)
```

`batch_size` controls the number of tasks the executor groups per call. The ASR stage defines `process_batch()` as its canonical method — the executor groups tasks by `batch_size` before calling it.

Within a single `AudioTask`, `process_batch()` transcribes the audio file path.

## Input Requirements

### AudioTask Format

Data loading stages create input `AudioTask` objects that must contain:

```python
# AudioTask's data object structure (created automatically by loading stages)
# Each item in the batch contains:
{
    "audio_filepath": "/path/to/audio1.wav",
    # Optional: existing metadata
    "duration": 5.2,
    "language": "en"
}
```

### Audio File Requirements

* **Supported Formats**: Determined by the selected NeMo ASR model; refer to the NeMo ASR documentation.
* **Sample Rates**: Typically 16 kHz; refer to the model card for details.
* **Channels**: Mono or stereo; channel handling (for example, down-mixing) depends on the model.
* **Duration**: Long files can require manual chunking before inference.

## Output Structure

The ASR stage adds predicted transcriptions to each audio sample:

```python
# Output AudioTask's data object structure
{
    "audio_filepath": "/path/to/audio1.wav",
    "pred_text": "this is the predicted transcription",
    "duration": 5.2,  # Preserved from input
    "language": "en"  # Preserved from input
}
```

## Error Handling

### Model Loading Errors

```python
try:
    asr_stage.setup()
except RuntimeError as e:
    print(f"Failed to load ASR model: {e}")
    # Fallback to CPU or different model
```

### Processing Errors

Processing behavior:

* **Input structure validation**: The stage uses `validate_input()` to check required attributes/columns and raises `ValueError` if they are missing.
* **Model loading failures**: `setup()` raises `RuntimeError` if model download or initialization fails.
* **No automatic retries or auto-tuning**: The stage does not perform automatic batch size reduction or network retries.
* **Missing files**: `AudioTask.validate()` can log file-existence warnings when code creates tasks; the stage does not auto-skip files.

## Performance Optimization

### GPU Memory Management

```python
# For large models or limited GPU memory
asr_stage = InferenceAsrNemoStage(
    model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
).with_(
    batch_size=8,  # Reduce batch size
    resources=Resources(gpus=0.5)  # Share GPU resources
)
```

### Distributed Processing

```python
from nemo_curator.backends.xenna import XennaExecutor

# Configure executor (refer to Pipeline Execution Backends)
executor = XennaExecutor(
    config={
        "execution_mode": "streaming",
        "logging_interval": 60,
        "ignore_failures": False
    }
)
```

## Integration Examples

### Complete Audio-to-Text Pipeline

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage

pipeline = Pipeline(name="audio_to_text")

# ASR inference
pipeline.add_stage(asr_stage)

# Convert to text format
pipeline.add_stage(AudioToDocumentStage())

# Continue with text processing...
```