ALM Pipeline Tutorial
Learn how to curate training data for audio language models using NVIDIA NeMo Curator’s ALM pipeline. This tutorial walks you through reading diarized audio manifests, constructing fixed-duration training windows, filtering overlapping windows, and writing the results.
Overview
This tutorial demonstrates the ALM data curation workflow:
- Read Manifests: Stream JSONL manifests with diarized audio metadata
- Build Windows: Construct candidate training windows from consecutive segments
- Filter Overlaps: Remove redundant windows that share too much audio content
- Write Results: Export filtered windows as JSONL for downstream training
What you will learn:
- How to configure and run the four-stage ALM pipeline
- Tuning window duration, speaker count, and quality thresholds
- Selecting between Xenna and Ray Data backends
- Interpreting pipeline output and loss statistics
Working Example Location
The complete working code for this tutorial is located at:
Accessing the code:
Prerequisites
- NeMo Curator installed with audio extras (refer to the Installation Guide)
- Python 3.10 or later
- Input data in JSONL format with diarized segments (refer to the input format section)
- Basic familiarity with Hydra configuration
The ALM pipeline runs entirely on CPU. No GPU is required.
Input Format
Each line of the input JSONL manifest must contain the following fields:
Required fields:
audio_filepath: Path to the source audio fileaudio_sample_rate: Sample rate in Hz (entries belowmin_sample_rateare skipped)segments: Array of diarized speech segments, each withstart,end,speaker, andmetrics.bandwidth
Sample input data is available at tests/fixtures/audio/alm/sample_input.jsonl in the repository.
Step-by-Step Walkthrough
Step 1: Review the Pipeline Configuration
The ALM pipeline is defined in pipeline.yaml with four stages:
Step 2: Understand the Configuration Parameters
The following table describes the key parameters for each stage:
ALMDataBuilderStage parameters:
ALMDataOverlapStage parameters:
Step 3: Run the Pipeline
Run the pipeline using the Hydra-based runner:
Override individual stage parameters from the command line:
Step 4: Run with the Sample Data
Test the pipeline with the included sample data:
Run this command from the repository root so the fixture path matches what the in-repo tutorials/audio/alm/README.md uses:
Expected output with sample data (five input entries):
- 181 candidate windows from the builder stage
- 25 filtered windows after overlap filtering at 50% threshold
- Approximately 3,035 seconds of total filtered audio duration
Understanding the Results
After the pipeline completes, the output JSONL file contains one line per input entry. The example below highlights the most common fields; real output also includes the pre-filter candidate windows list and additional duration and diagnostic counters (dur_lost_bw, dur_lost_sr, audio_sample_rate, manifest_filepath) that are omitted here for brevity.
Key output fields:
windows: All candidate windows produced byALMDataBuilderStagebefore overlap filtering (preserved so you can compare pre- and post-filter results)filtered_windows: Windows that passed both quality and overlap filteringspeaker_durations: Top five speakers by duration within each window, zero-padded to length fivefiltered_dur: Total duration of all filtered windows for this entryfiltered_dur_list: Duration of each individual filtered windowtotal_dur_window: Total duration of all input windows before filteringstats: Breakdown of why segments were excluded (bandwidth, sample rate, speaker count, window constraints)truncation_events: Number of segments that were truncated to fit within the maximum window duration
Reading the Loss Statistics
The stats dictionary helps diagnose low pipeline yield:
Customization Examples
Shorter Windows for Fine-Tuning
Permissive Filtering for Maximum Yield
Processing Multiple Manifest Files
Pass a list of paths or a directory:
The ALMManifestReader discovers all .jsonl and .json files in the directory and its subdirectories.
Next Steps
After completing this tutorial, explore:
- ALM Data Builder: Detailed reference for window construction
- ALM Overlap Filtering: Detailed reference for overlap filtering
- ALM Pipeline Concepts: Architectural overview
- Beginner Tutorial: FLEURS-based ASR pipeline for comparison
Related Topics
- Audio Curation Pipeline: Broader audio curation workflow
- Manifests and Ingest: Manifest format concepts
- Execution Backends: Xenna and Ray Data backend details