Overview | NeMo Curator

Curate training data for audio language models by extracting fixed-duration windows from diarized audio segments. The ALM stages read JSONL manifests, build candidate windows that meet quality constraints, remove overlapping windows, and write the filtered results.

How it Works

The ALM pipeline processes audio manifests through a four-stage chain:

ALMManifestReader reads JSONL manifests line-by-line, producing one AudioTask per entry
ALMDataBuilderStage constructs candidate windows from consecutive segments, applying sample rate, bandwidth, speaker count, and duration constraints
ALMDataOverlapStage removes windows that share too much audio content, keeping windows closest to the target duration
ALMManifestWriterStage writes filtered results as JSONL

All stages run on CPU and support both Xenna and Ray Data backends.

ALM Stages

ALM Data Builder

Construct candidate training windows from diarized audio segments with quality filtering windowing speaker-count bandwidth

ALM Overlap Filtering

Remove redundant overlapping windows based on configurable thresholds deduplication overlap-ratio target-duration

Quick Example

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.audio.alm import (
3     ALMManifestReader,
4     ALMDataBuilderStage,
5     ALMDataOverlapStage,
6     ALMManifestWriterStage,
7 )
8 
9 pipeline = Pipeline(name="alm_curation")
10 
11 # Read input manifests
12 pipeline.add_stage(ALMManifestReader(manifest_path="/data/manifests/"))
13 
14 # Build 120-second training windows
15 pipeline.add_stage(
16     ALMDataBuilderStage(
17         target_window_duration=120.0,
18         tolerance=0.1,
19         min_speakers=2,
20         max_speakers=5,
21     )
22 )
23 
24 # Remove windows with more than 50% overlap
25 pipeline.add_stage(
26     ALMDataOverlapStage(
27         overlap_percentage=50,
28         target_duration=120.0,
29     )
30 )
31 
32 # Write results
33 pipeline.add_stage(ALMManifestWriterStage(output_path="/data/output/alm.jsonl"))

ALM Pipeline Concepts: Architectural overview of the ALM pipeline
ALM Tutorial: Step-by-step guide with sample data
Manifests and Ingest: General manifest format concepts

How it Works

ALM Stages

Quick Example

Related Topics