Curate AudioProcess DataALM Data Curation

ALM Data Curation

View as Markdown

Curate training data for audio language models by extracting fixed-duration windows from diarized audio segments. The ALM stages read JSONL manifests, build candidate windows that meet quality constraints, remove overlapping windows, and write the filtered results.

How it Works

The ALM pipeline processes audio manifests through a four-stage chain:

  1. ALMManifestReader reads JSONL manifests line-by-line, producing one AudioTask per entry
  2. ALMDataBuilderStage constructs candidate windows from consecutive segments, applying sample rate, bandwidth, speaker count, and duration constraints
  3. ALMDataOverlapStage removes windows that share too much audio content, keeping windows closest to the target duration
  4. ALMManifestWriterStage writes filtered results as JSONL

All stages run on CPU and support both Xenna and Ray Data backends.

ALM Stages

Quick Example

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.audio.alm import (
3 ALMManifestReader,
4 ALMDataBuilderStage,
5 ALMDataOverlapStage,
6 ALMManifestWriterStage,
7)
8
9pipeline = Pipeline(name="alm_curation")
10
11# Read input manifests
12pipeline.add_stage(ALMManifestReader(manifest_path="/data/manifests/"))
13
14# Build 120-second training windows
15pipeline.add_stage(
16 ALMDataBuilderStage(
17 target_window_duration=120.0,
18 tolerance=0.1,
19 min_speakers=2,
20 max_speakers=5,
21 )
22)
23
24# Remove windows with more than 50% overlap
25pipeline.add_stage(
26 ALMDataOverlapStage(
27 overlap_percentage=50,
28 target_duration=120.0,
29 )
30)
31
32# Write results
33pipeline.add_stage(ALMManifestWriterStage(output_path="/data/output/alm.jsonl"))