nemo_curator.stages.audio.io.extract_segments
nemo_curator.stages.audio.io.extract_segments
nemo_curator.stages.audio.io.extract_segments
Audio segment extraction stage.
Extracts audio segments from original source files based on manifest entries produced by NeMo Curator audio pipelines. Auto-detects the pipeline combo from the manifest schema and applies the appropriate extraction strategy:
Combo 2 (no VAD / VAD only):
Extracts each segment by original_start_ms / original_end_ms.
Output: {original_filename}_segment_{NNN}.{format}
Combo 3 (speaker diarization):
Extracts each speaking interval from diar_segments per speaker.
Output: {original_filename}_speaker_{X}_segment_{NNN}.{format}
Combo 4 (VAD + speaker):
Extracts each speaker-segment by timestamps.
Output: {original_filename}_speaker_{X}_segment_{NNN}.{format}
Bases: ProcessingStage[AudioTask, AudioTask]
Extract audio segments from original files based on manifest entries.
Receives AudioTask objects whose data dicts are manifest
entries (produced by TimestampMapperStage). For each entry the
stage reads the audio slice from the original file and writes it as
a standalone segment file.
The pipeline combo is auto-detected from the first entry in each
batch. Entries are grouped by original_file so each source is
opened only once per batch.
This is an IO stage: process() raises NotImplementedError
and all work is done in process_batch(), following the same
pattern as AudioToDocumentStage and ALMManifestWriterStage.
Parameters:
Directory where extracted segment files are written.
Audio format — wav, flac, or ogg.
Combo 2: extract by original_start_ms / original_end_ms.
Group-by-file -> read -> write -> metadata loop.
Combo 3: extract each diar_segment per speaker.
Combo 4: extract speaker-segments by timestamps.
Load a manifest file (or directory of JSONL files) and extract all segments.
This is a convenience method for standalone usage outside
of a pipeline. It handles manifest loading, combo detection,
CSV metadata, and summary JSON — equivalent to the old
extract_segments() function.
Extract quality/filter score fields from a manifest entry.
Returns all keys that are not structural CSV columns (timestamps, duration, speaker info), with float values rounded for readability. Since TimestampMapper already whitelist-filters the manifest output, anything remaining is a quality score or user-defined field.
Return (speaker_id, speaker_num) from a manifest entry.
Read a slice of audio from a file.
Write metadata.csv from collected metadata rows.
Detect which pipeline combo produced the manifest.
Returns 2, 3, or 4. Since TimestampMapper always emits
original_start_ms/original_end_ms, combos 1 and 2 are
indistinguishable and both use timestamp-based extraction.
Returns: int
segments by timestamps (combos 1 and 2)
Extract segments from original audio files based on manifest.
Extract segments by original_start_ms / original_end_ms, sorted by start time.
Extract individual speaking intervals from diar_segments per speaker.
Extract speaker-segments using original_start_ms / original_end_ms.
Load a single manifest.jsonl file and return list of entries.
Load entries from a single jsonl file or a directory of jsonl files.