Perform automatic speech recognition (ASR) on audio files using NeMo Framework models. The ASR inference stage transcribes audio into text, enabling downstream quality assessment and text processing workflows.
The InferenceAsrNemoStage processes AudioTask objects by:
AudioTask with original data plus predicted transcriptionsNeMo Framework provides ready-to-use ASR models for several languages and domains:
batch_size controls the number of tasks the executor groups per call. The ASR stage defines process_batch() as its canonical method — the executor groups tasks by batch_size before calling it.
Within a single AudioTask, process_batch() transcribes the audio file path.
Data loading stages create input AudioTask objects that must contain:
The ASR stage adds predicted transcriptions to each audio sample:
Processing behavior:
validate_input() to check required attributes/columns and raises ValueError if they are missing.setup() raises RuntimeError if model download or initialization fails.AudioTask.validate() can log file-existence warnings when code creates tasks; the stage does not auto-skip files.