Learn how to curate training data for audio language models using NVIDIA NeMo Curator’s ALM pipeline. This tutorial walks you through reading diarized audio manifests, constructing fixed-duration training windows, filtering overlapping windows, and writing the results.
This tutorial demonstrates the ALM data curation workflow:
What you will learn:
The complete working code for this tutorial is located at:
Accessing the code:
The ALM pipeline runs entirely on CPU. No GPU is required.
Each line of the input JSONL manifest must contain the following fields:
Required fields:
audio_filepath: Path to the source audio fileaudio_sample_rate: Sample rate in Hz (entries below min_sample_rate are skipped)segments: Array of diarized speech segments, each with start, end, speaker, and metrics.bandwidthSample input data is available at tests/fixtures/audio/alm/sample_input.jsonl in the repository.
The ALM pipeline is defined in pipeline.yaml with four stages:
The following table describes the key parameters for each stage:
ALMDataBuilderStage parameters:
ALMDataOverlapStage parameters:
Run the pipeline using the Hydra-based runner:
Override individual stage parameters from the command line:
Test the pipeline with the included sample data:
Run this command from the repository root so the fixture path matches what the in-repo tutorials/audio/alm/README.md uses:
Expected output with sample data (five input entries):
After the pipeline completes, the output JSONL file contains one line per input entry. The example below highlights the most common fields; real output also includes the pre-filter candidate windows list and additional duration and diagnostic counters (dur_lost_bw, dur_lost_sr, audio_sample_rate, manifest_filepath) that are omitted here for brevity.
Key output fields:
windows: All candidate windows produced by ALMDataBuilderStage before overlap filtering (preserved so you can compare pre- and post-filter results)filtered_windows: Windows that passed both quality and overlap filteringspeaker_durations: Top five speakers by duration within each window, zero-padded to length fivefiltered_dur: Total duration of all filtered windows for this entryfiltered_dur_list: Duration of each individual filtered windowtotal_dur_window: Total duration of all input windows before filteringstats: Breakdown of why segments were excluded (bandwidth, sample rate, speaker count, window constraints)truncation_events: Number of segments that were truncated to fit within the maximum window durationThe stats dictionary helps diagnose low pipeline yield:
Pass a list of paths or a directory:
The ALMManifestReader discovers all .jsonl and .json files in the directory and its subdirectories.
After completing this tutorial, explore: