For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
          • Config
          • Core
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
            • Audio
              • Advanced Pipelines
              • Alm
              • Common
              • Datasets
              • Filtering
              • Inference
              • Io
                • Convert
                • Extract Segments
              • Metrics
              • Postprocessing
              • Preprocessing
              • Segmentation
              • Tagging
            • Base
            • Client Partitioning
            • Deduplication
            • File Partitioning
            • Function Decorators
            • Image
            • Interleaved
            • Math
            • Resources
            • Synthetic
            • Text
            • Video
          • Tasks
          • Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Classes
  • Functions
  • Data
  • API
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorStagesAudioIo

nemo_curator.stages.audio.io.extract_segments

||View as Markdown|
Previous

nemo_curator.stages.audio.io.convert

Next

nemo_curator.stages.audio.metrics

Audio segment extraction stage.

Extracts audio segments from original source files based on manifest entries produced by NeMo Curator audio pipelines. Auto-detects the pipeline combo from the manifest schema and applies the appropriate extraction strategy:

Combo 2 (no VAD / VAD only): Extracts each segment by original_start_ms / original_end_ms. Output: {original_filename}_segment_{NNN}.{format}

Combo 3 (speaker diarization): Extracts each speaking interval from diar_segments per speaker. Output: {original_filename}_speaker_{X}_segment_{NNN}.{format}

Combo 4 (VAD + speaker): Extracts each speaker-segment by timestamps. Output: {original_filename}_speaker_{X}_segment_{NNN}.{format}

Module Contents

Classes

NameDescription
SegmentExtractionStageExtract audio segments from original files based on manifest entries.

Functions

NameDescription
_base_metadata-
_extract_scoresExtract quality/filter score fields from a manifest entry.
_get_speaker_labelReturn (speaker_id, speaker_num) from a manifest entry.
_intervals_from_diar_segments-
_intervals_from_timestamps-
_read_segmentRead a slice of audio from a file.
_write_metadata_csvWrite metadata.csv from collected metadata rows.
detect_comboDetect which pipeline combo produced the manifest.
extract_segmentsExtract segments from original audio files based on manifest.
extract_segments_by_timestampsExtract segments by original_start_ms / original_end_ms, sorted by start time.
extract_speaker_diar_segmentsExtract individual speaking intervals from diar_segments per speaker.
extract_speaker_segments_by_timestampsExtract speaker-segments using original_start_ms / original_end_ms.
load_manifestLoad a single manifest.jsonl file and return list of entries.
load_manifestsLoad entries from a single jsonl file or a directory of jsonl files.

Data

DEFAULT_OUTPUT_FORMAT

Interval

SOUNDFILE_FORMATS

_CSV_STRUCTURAL_KEYS

API

class nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage(
name: str = 'SegmentExtraction',
output_dir: str = '',
output_format: str = DEFAULT_OUTPUT_FORMAT,
batch_size: int = 64,
resources: nemo_curator.stages.resources.Resources = (lambda: Resources(cpus=1.0...
)
Dataclass

Bases: ProcessingStage[AudioTask, AudioTask]

Extract audio segments from original files based on manifest entries.

Receives AudioTask objects whose data dicts are manifest entries (produced by TimestampMapperStage). For each entry the stage reads the audio slice from the original file and writes it as a standalone segment file.

The pipeline combo is auto-detected from the first entry in each batch. Entries are grouped by original_file so each source is opened only once per batch.

This is an IO stage: process() raises NotImplementedError and all work is done in process_batch(), following the same pattern as AudioToDocumentStage and ALMManifestWriterStage.

Parameters:

output_dir
strDefaults to ''

Directory where extracted segment files are written.

output_format
strDefaults to DEFAULT_OUTPUT_FORMAT

Audio format — wav, flac, or ogg.

batch_size
int = 64
name
str = 'SegmentExtraction'
output_dir
str = ''
output_format
str = DEFAULT_OUTPUT_FORMAT
resources
Resources
nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage.__post_init__() -> None
nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage._extract_by_timestamps(
entries: list[dict]
) -> tuple[int, float, dict[str, int], list[dict]]

Combo 2: extract by original_start_ms / original_end_ms.

nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage._extract_file_segments(
entries: list[dict],
sort_key: collections.abc.Callable[[dict], typing.Any],
get_intervals: collections.abc.Callable[[dict], list[nemo_curator.stages.audio.io.extract_segments.Interval]],
make_filename: collections.abc.Callable[[str, dict, int], str]
) -> tuple[int, float, dict[str, int], list[dict]]

Group-by-file -> read -> write -> metadata loop.

nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage._extract_speaker_diar(
entries: list[dict]
) -> tuple[int, float, dict[str, int], list[dict]]

Combo 3: extract each diar_segment per speaker.

nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage._extract_speaker_timestamps(
entries: list[dict]
) -> tuple[int, float, dict[str, int], list[dict]]

Combo 4: extract speaker-segments by timestamps.

nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage.extract_from_manifest(
input_path: str
) -> None

Load a manifest file (or directory of JSONL files) and extract all segments.

This is a convenience method for standalone usage outside of a pipeline. It handles manifest loading, combo detection, CSV metadata, and summary JSON — equivalent to the old extract_segments() function.

nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage.num_workers() -> int | None
nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage.process(
task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask
nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage.process_batch(
tasks: list[nemo_curator.tasks.AudioTask]
) -> list[nemo_curator.tasks.AudioTask]
nemo_curator.stages.audio.io.extract_segments.SegmentExtractionStage.xenna_stage_spec() -> dict[str, typing.Any]
nemo_curator.stages.audio.io.extract_segments._base_metadata(
filename: str,
original_file: str,
entry: dict,
seg_idx: int,
start_ms: int,
end_ms: int,
dur: float
) -> dict
nemo_curator.stages.audio.io.extract_segments._extract_scores(
entry: dict
) -> dict

Extract quality/filter score fields from a manifest entry.

Returns all keys that are not structural CSV columns (timestamps, duration, speaker info), with float values rounded for readability. Since TimestampMapper already whitelist-filters the manifest output, anything remaining is a quality score or user-defined field.

nemo_curator.stages.audio.io.extract_segments._get_speaker_label(
entry: dict
) -> tuple[str, str]

Return (speaker_id, speaker_num) from a manifest entry.

nemo_curator.stages.audio.io.extract_segments._intervals_from_diar_segments(
entry: dict
) -> list[nemo_curator.stages.audio.io.extract_segments.Interval]
nemo_curator.stages.audio.io.extract_segments._intervals_from_timestamps(
entry: dict
) -> list[nemo_curator.stages.audio.io.extract_segments.Interval]
nemo_curator.stages.audio.io.extract_segments._read_segment(
filepath: str,
start_ms: int,
end_ms: int,
sample_rate: int
) -> numpy.ndarray

Read a slice of audio from a file.

nemo_curator.stages.audio.io.extract_segments._write_metadata_csv(
output_dir: str,
metadata_rows: list[dict]
) -> str

Write metadata.csv from collected metadata rows.

nemo_curator.stages.audio.io.extract_segments.detect_combo(
entries: list
) -> int

Detect which pipeline combo produced the manifest.

Returns 2, 3, or 4. Since TimestampMapper always emits original_start_ms/original_end_ms, combos 1 and 2 are indistinguishable and both use timestamp-based extraction.

Returns: int

segments by timestamps (combos 1 and 2)

nemo_curator.stages.audio.io.extract_segments.extract_segments(
input_path: str,
output_dir: str,
output_format: str = DEFAULT_OUTPUT_FORMAT
) -> None

Extract segments from original audio files based on manifest.

nemo_curator.stages.audio.io.extract_segments.extract_segments_by_timestamps(
entries: list,
output_dir: str,
output_format: str
) -> tuple[int, float, dict[str, int], list[dict]]

Extract segments by original_start_ms / original_end_ms, sorted by start time.

nemo_curator.stages.audio.io.extract_segments.extract_speaker_diar_segments(
entries: list,
output_dir: str,
output_format: str
) -> tuple[int, float, dict[str, int], list[dict]]

Extract individual speaking intervals from diar_segments per speaker.

nemo_curator.stages.audio.io.extract_segments.extract_speaker_segments_by_timestamps(
entries: list,
output_dir: str,
output_format: str
) -> tuple[int, float, dict[str, int], list[dict]]

Extract speaker-segments using original_start_ms / original_end_ms.

nemo_curator.stages.audio.io.extract_segments.load_manifest(
manifest_path: str
) -> list

Load a single manifest.jsonl file and return list of entries.

nemo_curator.stages.audio.io.extract_segments.load_manifests(
input_path: str,
output_dir: str
) -> list

Load entries from a single jsonl file or a directory of jsonl files.

nemo_curator.stages.audio.io.extract_segments.DEFAULT_OUTPUT_FORMAT = 'wav'
nemo_curator.stages.audio.io.extract_segments.Interval = tuple[int, int, float]
nemo_curator.stages.audio.io.extract_segments.SOUNDFILE_FORMATS = {'wav': 'PCM_16', 'flac': 'PCM_16', 'ogg': 'VORBIS'}
nemo_curator.stages.audio.io.extract_segments._CSV_STRUCTURAL_KEYS = frozenset({'filename', 'original_file', 'original_start_ms', 'original_end_ms', ...