Curate AudioTutorials

ALM Pipeline Tutorial

View as Markdown

Learn how to curate training data for audio language models using NVIDIA NeMo Curator’s ALM pipeline. This tutorial walks you through reading diarized audio manifests, constructing fixed-duration training windows, filtering overlapping windows, and writing the results.

Overview

This tutorial demonstrates the ALM data curation workflow:

  1. Read Manifests: Stream JSONL manifests with diarized audio metadata
  2. Build Windows: Construct candidate training windows from consecutive segments
  3. Filter Overlaps: Remove redundant windows that share too much audio content
  4. Write Results: Export filtered windows as JSONL for downstream training

What you will learn:

  • How to configure and run the four-stage ALM pipeline
  • Tuning window duration, speaker count, and quality thresholds
  • Selecting between Xenna and Ray Data backends
  • Interpreting pipeline output and loss statistics

Working Example Location

The complete working code for this tutorial is located at:

<nemo_curator_repository>/tutorials/audio/alm/
├── README.md # Tutorial documentation
├── main.py # Hydra-based pipeline runner
└── pipeline.yaml # Pipeline configuration

Accessing the code:

$git clone https://github.com/NVIDIA/NeMo-Curator.git
$cd NeMo-Curator/tutorials/audio/alm/

Prerequisites

  • NeMo Curator installed with audio extras (refer to the Installation Guide)
  • Python 3.10 or later
  • Input data in JSONL format with diarized segments (refer to the input format section)
  • Basic familiarity with Hydra configuration

The ALM pipeline runs entirely on CPU. No GPU is required.

Input Format

Each line of the input JSONL manifest must contain the following fields:

1{
2 "audio_filepath": "/path/to/audio.wav",
3 "audio_sample_rate": 16000,
4 "segments": [
5 {
6 "start": 0.0,
7 "end": 5.2,
8 "speaker": "speaker_0",
9 "text": "transcript text",
10 "metrics": {"bandwidth": 8000}
11 }
12 ]
13}

Required fields:

  • audio_filepath: Path to the source audio file
  • audio_sample_rate: Sample rate in Hz (entries below min_sample_rate are skipped)
  • segments: Array of diarized speech segments, each with start, end, speaker, and metrics.bandwidth

Sample input data is available at tests/fixtures/audio/alm/sample_input.jsonl in the repository.

Step-by-Step Walkthrough

Step 1: Review the Pipeline Configuration

The ALM pipeline is defined in pipeline.yaml with four stages:

1stages:
2 # Stage 0: Read JSONL manifests
3 - _target_: nemo_curator.stages.audio.alm.ALMManifestReader
4 manifest_path: ${manifest_path}
5 files_per_partition: 1
6
7 # Stage 1: Build candidate windows
8 - _target_: nemo_curator.stages.audio.alm.ALMDataBuilderStage
9 target_window_duration: 120.0
10 tolerance: 0.1
11 min_sample_rate: 16000
12 min_bandwidth: 8000
13 min_speakers: 2
14 max_speakers: 5
15 truncation: true
16 drop_fields: "words"
17 drop_fields_top_level: "words,segments"
18
19 # Stage 2: Filter overlapping windows
20 - _target_: nemo_curator.stages.audio.alm.ALMDataOverlapStage
21 overlap_percentage: 50
22 target_duration: 120.0
23
24 # Stage 3: Write filtered output
25 - _target_: nemo_curator.stages.audio.alm.ALMManifestWriterStage
26 output_path: ${output_dir}/alm_output.jsonl

Step 2: Understand the Configuration Parameters

The following table describes the key parameters for each stage:

ALMDataBuilderStage parameters:

ParameterTypeDefaultDescription
target_window_durationfloat120.0Target window length in seconds
tolerancefloat0.1Acceptable deviation from target (10% means 108 to 132 seconds)
min_sample_rateint16,000Minimum sample rate in Hz
min_bandwidthint8,000Minimum bandwidth per segment in Hz
min_speakersint2Minimum distinct speakers per window
max_speakersint5Maximum distinct speakers per window
truncationboolTrueTruncate segments exceeding maximum duration
drop_fieldsstr"words"Comma-separated segment-level fields to remove
drop_fields_top_levelstr"words,segments"Comma-separated entry-level fields to remove

ALMDataOverlapStage parameters:

ParameterTypeDefaultDescription
overlap_percentageint0Overlap threshold (0 = aggressive, 100 = keep all)
target_durationfloat120.0Preferred window duration for tie-breaking

Step 3: Run the Pipeline

Run the pipeline using the Hydra-based runner:

$# Using default Xenna backend
$python main.py \
> --config-path . \
> --config-name pipeline \
> manifest_path=/path/to/manifests \
> output_dir=./alm_output
$
$# Using Ray Data backend
$python main.py \
> --config-path . \
> --config-name pipeline \
> manifest_path=/path/to/manifests \
> output_dir=./alm_output \
> backend=ray_data

Override individual stage parameters from the command line:

$# Shorter windows with stricter overlap filtering
$python main.py \
> --config-path . \
> --config-name pipeline \
> manifest_path=/path/to/manifests \
> output_dir=./alm_output \
> stages.1.target_window_duration=60 \
> stages.2.overlap_percentage=30

Step 4: Run with the Sample Data

Test the pipeline with the included sample data:

Run this command from the repository root so the fixture path matches what the in-repo tutorials/audio/alm/README.md uses:

$# From the NeMo-Curator repo root
$python tutorials/audio/alm/main.py \
> --config-path . \
> --config-name pipeline \
> manifest_path=tests/fixtures/audio/alm/sample_input.jsonl \
> output_dir=./sample_output

Expected output with sample data (five input entries):

  • 181 candidate windows from the builder stage
  • 25 filtered windows after overlap filtering at 50% threshold
  • Approximately 3,035 seconds of total filtered audio duration

Understanding the Results

After the pipeline completes, the output JSONL file contains one line per input entry. The example below highlights the most common fields; real output also includes the pre-filter candidate windows list and additional duration and diagnostic counters (dur_lost_bw, dur_lost_sr, audio_sample_rate, manifest_filepath) that are omitted here for brevity.

1{
2 "audio_filepath": "/path/to/audio.wav",
3 "windows": ["<all candidate windows from the builder stage>"],
4 "filtered_windows": [
5 {
6 "segments": [
7 {"start": 0.0, "end": 5.2, "speaker": "speaker_0"}
8 ],
9 "speaker_durations": [45.2, 38.1, 22.5, 14.2, 0.0]
10 }
11 ],
12 "filtered_dur": 120.5,
13 "filtered_dur_list": [120.5],
14 "total_dur_window": 3250.0,
15 "truncation_events": 3,
16 "stats": {
17 "total_segments": 150,
18 "total_dur": 3600.0,
19 "lost_bw": 5,
20 "lost_sr": 0,
21 "lost_spk": 12,
22 "lost_win": 8,
23 "lost_no_spkr": 2,
24 "lost_next_seg_bm": 1
25 }
26}

Key output fields:

  • windows: All candidate windows produced by ALMDataBuilderStage before overlap filtering (preserved so you can compare pre- and post-filter results)
  • filtered_windows: Windows that passed both quality and overlap filtering
  • speaker_durations: Top five speakers by duration within each window, zero-padded to length five
  • filtered_dur: Total duration of all filtered windows for this entry
  • filtered_dur_list: Duration of each individual filtered window
  • total_dur_window: Total duration of all input windows before filtering
  • stats: Breakdown of why segments were excluded (bandwidth, sample rate, speaker count, window constraints)
  • truncation_events: Number of segments that were truncated to fit within the maximum window duration

Reading the Loss Statistics

The stats dictionary helps diagnose low pipeline yield:

StatisticMeaningTuning Action
lost_bwSegments below minimum bandwidthLower min_bandwidth if audio quality is acceptable
lost_srEntries below minimum sample rateLower min_sample_rate or resample input audio
lost_spkWindows outside speaker count rangeWiden min_speakers and max_speakers range
lost_winWindows outside duration toleranceIncrease tolerance or adjust target_window_duration
lost_no_spkrWindow growth blocked by a segment without a speaker label (sub-category of lost_win)Improve upstream diarization or filter out unlabeled segments before curation
lost_next_seg_bmWindow growth blocked by a low-bandwidth segment (sub-category of lost_win)Lower min_bandwidth if the blocked segments are otherwise acceptable

Customization Examples

Shorter Windows for Fine-Tuning

1stages:
2 - _target_: nemo_curator.stages.audio.alm.ALMDataBuilderStage
3 target_window_duration: 30.0
4 tolerance: 0.2
5 min_speakers: 2
6 max_speakers: 3

Permissive Filtering for Maximum Yield

1stages:
2 - _target_: nemo_curator.stages.audio.alm.ALMDataBuilderStage
3 min_bandwidth: 4000
4 min_sample_rate: 8000
5 min_speakers: 1
6 max_speakers: 10
7
8 - _target_: nemo_curator.stages.audio.alm.ALMDataOverlapStage
9 overlap_percentage: 80

Processing Multiple Manifest Files

Pass a list of paths or a directory:

$python main.py \
> --config-path . \
> --config-name pipeline \
> manifest_path=/data/manifests/ \
> output_dir=./alm_output

The ALMManifestReader discovers all .jsonl and .json files in the directory and its subdirectories.

Next Steps

After completing this tutorial, explore: