nemo_curator.stages.audio.tagging.split
nemo_curator.stages.audio.tagging.split
nemo_curator.stages.audio.tagging.split
Audio Splitting and Joining Stages.
Bases: ProcessingStage[AudioTask, AudioTask]
Stage for joining metadata of previously split audio files.
Combines the metadata (transcripts and alignments) of audio files that were previously split by SplitLongAudioStage. Adjusts timestamps and concatenates transcripts to recreate the original audio’s metadata.
Join metadata from split audio files.
Process entries and join split audio metadata.
This stage collects all entries and processes meta-entries to join split audio files back together.
Bases: CompositeStage[AudioTask, AudioTask]
Composite stage: Split long audio -> ASR align -> Join results.
Decomposes into three sequential stages that always run together:
suggested_max_lenParameters:
Target max length for audio segments (seconds).
Minimum length for any split segment (also used by ASR).
Maximum length of audio segments for ASR processing (seconds).
Pretrained NeMo ASR model name.
Local model file path (overrides model_name if set).
Whether the model encoder is FastConformer.
Decoder type — "ctc" or "rnnt".
Entries per processing chunk in ASR.
Batch size passed to the ASR model’s transcribe call.
Max entries/paths per batch when chunking segments.
Data-loading workers for ASR inference.
If True, run ASR only on individual segments rather than full audio / meta-entries.
Whether to compute word-level timestamps.
Timestamp granularity ("word" or "char").
Output key for predicted text.
Output key for word-level alignments.
Whether to disable word confidence scores.
Key for the segments list in each manifest entry.
Bases: ProcessingStage[AudioTask, AudioTask]
Stage that splits long audio files into smaller segments.
Processes audio files that exceed a specified maximum length by splitting them at natural pauses to maintain speech coherence.
Parameters:
Target maximum length for audio segments in seconds
Minimum length for any split segment
Build per-split metadata dicts from filepaths and durations.
Core splitting logic, separated to keep statement count within limits.
Get the split points for the audio file based on segments.
Process entry to split long audio files.