ALM Data Curation
Curate training data for audio language models by extracting fixed-duration windows from diarized audio segments. The ALM stages read JSONL manifests, build candidate windows that meet quality constraints, remove overlapping windows, and write the filtered results.
How it Works
The ALM pipeline processes audio manifests through a four-stage chain:
- ALMManifestReader reads JSONL manifests line-by-line, producing one
AudioTaskper entry - ALMDataBuilderStage constructs candidate windows from consecutive segments, applying sample rate, bandwidth, speaker count, and duration constraints
- ALMDataOverlapStage removes windows that share too much audio content, keeping windows closest to the target duration
- ALMManifestWriterStage writes filtered results as JSONL
All stages run on CPU and support both Xenna and Ray Data backends.
ALM Stages
ALM Data Builder
Construct candidate training windows from diarized audio segments with quality filtering windowing speaker-count bandwidth
ALM Overlap Filtering
Remove redundant overlapping windows based on configurable thresholds deduplication overlap-ratio target-duration
Quick Example
Related Topics
- ALM Pipeline Concepts: Architectural overview of the ALM pipeline
- ALM Tutorial: Step-by-step guide with sample data
- Manifests and Ingest: General manifest format concepts