nemo_curator.stages.audio.tagging.inference.nemo_asr_align

View as Markdown

NeMo ASR Aligner Stage.

Contains BaseASRProcessorStage (shared config and segment preparation) and NeMoASRAlignerStage (forced alignment via NeMo FastConformer).

These stages are tagging-pipeline-specific because they operate on tagging manifest keys like split_filepaths, split_metadata, and segments.

Module Contents

Classes

NameDescription
BaseASRProcessorStageBase class for ASR stages with shared config and segment preparation.
NeMoASRAlignerStageStage that aligns text and audio using NeMo ASR models.

API

class nemo_curator.stages.audio.tagging.inference.nemo_asr_align.BaseASRProcessorStage(
min_len: float = 1.0,
max_len: float = 40.0,
batch_size: int = 32,
num_workers: int = 10,
split_batch_size: int = 5000,
infer_segment_only: bool = False,
text_key: str = 'text',
words_key: str = 'words',
compute_timestamps: bool = True,
segments_key: str = 'segments',
name: str = 'BaseASRProcessor',
resources: nemo_curator.stages.resources.Resources = (lambda: Resources(gpus=1))()
)
Dataclass

Bases: ProcessingStage[AudioTask, AudioTask]

Base class for ASR stages with shared config and segment preparation.

Provides common fields and _prepare_segment_batch_with_metadata for segment-only inference. Subclasses must implement setup() and process().

Parameters:

min_len
floatDefaults to 1.0

Minimum length of audio segments to process (seconds).

max_len
floatDefaults to 40.0

Maximum length of audio segments to process (seconds).

num_workers
intDefaults to 10

Number of workers for data loading.

split_batch_size
intDefaults to 5000

Max entries/paths per batch when chunking.

infer_segment_only
boolDefaults to False

If True, process segments only; else full audio / meta-entries.

text_key
strDefaults to 'text'

Key for predicted text in manifest.

words_key
strDefaults to 'words'

Key for word alignments in manifest (same as SDP alignment_key).

compute_timestamps
boolDefaults to True

Whether to compute word-level timestamps.

segments_key
strDefaults to 'segments'

Key for segments list in manifest.

_device
str

Derive device from resources configuration.

batch_size
int = 32
compute_timestamps
bool = True
infer_segment_only
bool = False
max_len
float = 40.0
min_len
float = 1.0
name
str = 'BaseASRProcessor'
num_workers
int = 10
resources
Resources = field(default_factory=(lambda: Resources(gpus=1)))
segments_key
str = 'segments'
split_batch_size
int = 5000
text_key
str = 'text'
words_key
str = 'words'
nemo_curator.stages.audio.tagging.inference.nemo_asr_align.BaseASRProcessorStage._prepare_segment_batch_with_metadata(
metadata_batch: list[dict],
cut_audio_segments: bool = False,
segments_key: str = 'segments'
) -> list[dict]

Prepare segment metadata for a batch.

Collects segment metadata with indices for later processing. Mirrors generic_sdp BaseASRProcessor._prepare_segment_batch_with_metadata.

Parameters:

metadata_batch
list[dict]

List of metadata dicts, each with a segments list.

cut_audio_segments
boolDefaults to False

If True, load audio and cut segments (numpy); if False, only collect resampled_audio_filepath from segments.

segments_key
strDefaults to 'segments'

Key for the segments list in each metadata dict.

Returns: list[dict]

List of segment metadata dicts with metadata_idx, segment_idx, and

class nemo_curator.stages.audio.tagging.inference.nemo_asr_align.NeMoASRAlignerStage(
min_len: float = 1.0,
max_len: float = 40.0,
batch_size: int = 100,
num_workers: int = 10,
split_batch_size: int = 5000,
infer_segment_only: bool = False,
text_key: str = 'text',
words_key: str = 'words',
compute_timestamps: bool = True,
segments_key: str = 'segments',
name: str = 'NeMoASRAligner',
resources: nemo_curator.stages.resources.Resources = (lambda: Resources(gpus=1))(),
model_name: str = 'nvidia/parakeet-tdt_ctc-1.1b',
model_path: str | None = None,
is_fastconformer: bool = True,
decoder_type: str = 'rnnt',
transcribe_batch_size: int = 32,
timestamp_type: str = 'word',
disable_word_confidence: bool = False,
_asr_model: typing.Any = None,
_override_cfg: typing.Any = None
)
Dataclass

Bases: BaseASRProcessorStage

Stage that aligns text and audio using NeMo ASR models.

Uses a pre-trained ASR model to transcribe audio files and generate word-level alignments with timestamps. Supports both CTC and RNNT decoders and can process either full audio files or just specific segments.

Parameters:

model_name
strDefaults to 'nvidia/parakeet-tdt_ctc-1.1b'

Name of pretrained model to use. Defaults to “nvidia/parakeet-tdt_ctc-1.1b”

model_path
(str, Optional)Defaults to None

Path to local model file. If provided, overrides model_name

is_fastconformer
boolDefaults to True

Whether model’s encoder is FastConformer

decoder_type
strDefaults to 'rnnt'

Type of decoder (‘ctc’ or ‘rnnt’). Defaults to “rnnt”

transcribe_batch_size
intDefaults to 32

Batch size for transcribing. Defaults to 32

timestamp_type
strDefaults to 'word'

Type of timestamp (‘word’ or ‘char’)

disable_word_confidence
boolDefaults to False

Whether to disable word confidence score computation

_asr_model
Any = field(default=None, repr=False)
_override_cfg
Any = field(default=None, repr=False)
batch_size
int = 100
compute_timestamps
bool = True
decoder_type
str = 'rnnt'
disable_word_confidence
bool = False
infer_segment_only
bool = False
is_fastconformer
bool = True
max_len
float = 40.0
min_len
float = 1.0
model_name
str = 'nvidia/parakeet-tdt_ctc-1.1b'
model_path
str | None = None
name
str = 'NeMoASRAligner'
num_workers
int = 10
segments_key
str = 'segments'
text_key
str = 'text'
timestamp_type
str = 'word'
transcribe_batch_size
int = 32
words_key
str = 'words'
nemo_curator.stages.audio.tagging.inference.nemo_asr_align.NeMoASRAlignerStage.__post_init__() -> None

Validate config.

nemo_curator.stages.audio.tagging.inference.nemo_asr_align.NeMoASRAlignerStage.get_alignments_text(
hypotheses: typing.Any
) -> tuple[list, str]

Extract word alignments and text from model hypotheses.

nemo_curator.stages.audio.tagging.inference.nemo_asr_align.NeMoASRAlignerStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.tagging.inference.nemo_asr_align.NeMoASRAlignerStage.load_model() -> None
nemo_curator.stages.audio.tagging.inference.nemo_asr_align.NeMoASRAlignerStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.tagging.inference.nemo_asr_align.NeMoASRAlignerStage.process(
task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask
nemo_curator.stages.audio.tagging.inference.nemo_asr_align.NeMoASRAlignerStage.process_batch(
tasks: list[nemo_curator.tasks.AudioTask]
) -> list[nemo_curator.tasks.AudioTask]

Process a batch of AudioTasks for ASR alignment.

nemo_curator.stages.audio.tagging.inference.nemo_asr_align.NeMoASRAlignerStage.process_full_audio(
tasks: list[nemo_curator.tasks.AudioTask]
) -> list[nemo_curator.tasks.AudioTask]

Process entries as full audio (or meta-entries with split_filepaths).

nemo_curator.stages.audio.tagging.inference.nemo_asr_align.NeMoASRAlignerStage.process_segments(
tasks: list[nemo_curator.tasks.AudioTask]
) -> list[nemo_curator.tasks.AudioTask]

Process entries in segment-only mode (infer per segment).

nemo_curator.stages.audio.tagging.inference.nemo_asr_align.NeMoASRAlignerStage.setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Load model to device and configure decoding (called per replica).

nemo_curator.stages.audio.tagging.inference.nemo_asr_align.NeMoASRAlignerStage.setup_on_node(
_node_info: nemo_curator.backends.base.NodeInfo | None = None,
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Download model weights without loading into memory (called once per node).