Curate training data for audio language models by extracting fixed-duration windows from diarized audio segments. The ALM stages read JSONL manifests, build candidate windows that meet quality constraints, remove overlapping windows, and write the filtered results.
The ALM pipeline processes audio manifests through a four-stage chain:
AudioTask per entryAll stages run on CPU and support both Xenna and Ray Data backends.
Construct candidate training windows from diarized audio segments with quality filtering windowing speaker-count bandwidth
Remove redundant overlapping windows based on configurable thresholds deduplication overlap-ratio target-duration