nemo_curator.stages.audio.tagging.inference.nemo_asr_align
nemo_curator.stages.audio.tagging.inference.nemo_asr_align
NeMo ASR Aligner Stage.
Contains BaseASRProcessorStage (shared config and segment preparation) and NeMoASRAlignerStage (forced alignment via NeMo FastConformer).
These stages are tagging-pipeline-specific because they operate on
tagging manifest keys like split_filepaths, split_metadata,
and segments.
Module Contents
Classes
API
Bases: ProcessingStage[AudioTask, AudioTask]
Base class for ASR stages with shared config and segment preparation.
Provides common fields and _prepare_segment_batch_with_metadata for segment-only inference. Subclasses must implement setup() and process().
Parameters:
Minimum length of audio segments to process (seconds).
Maximum length of audio segments to process (seconds).
Number of workers for data loading.
Max entries/paths per batch when chunking.
If True, process segments only; else full audio / meta-entries.
Key for predicted text in manifest.
Key for word alignments in manifest (same as SDP alignment_key).
Whether to compute word-level timestamps.
Key for segments list in manifest.
Derive device from resources configuration.
Prepare segment metadata for a batch.
Collects segment metadata with indices for later processing. Mirrors generic_sdp BaseASRProcessor._prepare_segment_batch_with_metadata.
Parameters:
List of metadata dicts, each with a segments list.
If True, load audio and cut segments (numpy); if False, only collect resampled_audio_filepath from segments.
Key for the segments list in each metadata dict.
Returns: list[dict]
List of segment metadata dicts with metadata_idx, segment_idx, and
Bases: BaseASRProcessorStage
Stage that aligns text and audio using NeMo ASR models.
Uses a pre-trained ASR model to transcribe audio files and generate word-level alignments with timestamps. Supports both CTC and RNNT decoders and can process either full audio files or just specific segments.
Parameters:
Name of pretrained model to use. Defaults to “nvidia/parakeet-tdt_ctc-1.1b”
Path to local model file. If provided, overrides model_name
Whether model’s encoder is FastConformer
Type of decoder (‘ctc’ or ‘rnnt’). Defaults to “rnnt”
Batch size for transcribing. Defaults to 32
Type of timestamp (‘word’ or ‘char’)
Whether to disable word confidence score computation
Validate config.
Extract word alignments and text from model hypotheses.
Process a batch of AudioTasks for ASR alignment.
Process entries as full audio (or meta-entries with split_filepaths).
Process entries in segment-only mode (infer per segment).
Load model to device and configure decoding (called per replica).
Download model weights without loading into memory (called once per node).