Data Builder | NeMo Curator

ALMDataBuilderStage constructs candidate training windows from consecutive diarized audio segments. Each window must meet configurable constraints for duration, sample rate, bandwidth, and speaker count before it is included in the output.

How it Works

For each input manifest entry, the stage:

Checks the entry-level sample rate against min_sample_rate (skips the entire entry if below threshold)
Iterates over segments as potential window starting points
For each starting segment, appends consecutive segments until the accumulated duration reaches the target
Optionally truncates the final segment if the window exceeds the maximum duration
Validates that the window contains the required number of distinct speakers
Records loss statistics for segments that do not meet constraints

The stage processes one AudioTask at a time and mutates the task data in place.

Parameters

Parameter	Type	Default	Description
`target_window_duration`	float	120.0	Target window length in seconds
`tolerance`	float	0.1	Fractional deviation from target duration. A value of 0.1 produces a range of 108 to 132 seconds for a 120-second target.
`min_bandwidth`	int	8,000	Minimum bandwidth in Hz per segment. Segments below this threshold are excluded.
`min_sample_rate`	int	16,000	Minimum audio sample rate in Hz. Entries below this threshold are skipped entirely.
`min_speakers`	int	2	Minimum distinct speakers required per window
`max_speakers`	int	5	Maximum distinct speakers allowed per window
`truncation`	bool	True	Whether to truncate the final segment when a window exceeds the maximum duration
`drop_fields`	str	`"words"`	Comma-separated segment-level fields to remove from output
`drop_fields_top_level`	str	`"words,segments"`	Comma-separated entry-level fields to remove from output

Basic Usage

1 from nemo_curator.stages.audio.alm import ALMDataBuilderStage
2 
3 # Default configuration: 120-second windows, 2-5 speakers
4 builder = ALMDataBuilderStage()

Advanced Configuration

Short Windows for Fine-Tuning

1 builder = ALMDataBuilderStage(
2     target_window_duration=30.0,
3     tolerance=0.2,           # 24-36 seconds
4     min_speakers=2,
5     max_speakers=3,
6 )

Permissive Quality Thresholds

1 builder = ALMDataBuilderStage(
2     min_bandwidth=4000,      # Accept lower-quality audio
3     min_sample_rate=8000,    # Accept telephone-quality audio
4     min_speakers=1,          # Allow single-speaker windows
5     max_speakers=10,
6 )

Preserving Segment Fields

By default, the stage drops words from segments and words,segments from the top level. To preserve all fields:

1 builder = ALMDataBuilderStage(
2     drop_fields="",
3     drop_fields_top_level="",
4 )

Output Fields

The stage adds the following fields to each AudioTask:

Field	Type	Description
`windows`	list	Candidate windows, each containing `segments` and `speaker_durations`
`stats`	dict	Loss statistics tracking why segments were excluded
`truncation_events`	int	Number of segments that were truncated

Speaker Durations

Each window includes a speaker_durations array containing the total speaking time of the top five speakers, sorted by duration in descending order. The array is zero-padded to length five when a window has fewer than five speakers.

Loss Statistics

The stats dictionary contains the following counters:

Statistic	Description
`total_segments`	Total input segments processed
`total_dur`	Total input duration in seconds
`lost_bw`	Segments excluded for low bandwidth
`lost_sr`	Entries excluded for low sample rate
`lost_spk`	Windows excluded for speaker count outside range
`lost_win`	Windows excluded for duration outside tolerance
`lost_no_spkr`	Windows lost where growth was blocked by a segment without a speaker label (sub-category of `lost_win`)
`lost_next_seg_bm`	Windows lost where growth was blocked by a low-bandwidth segment (sub-category of `lost_win`)

Best Practices

Start with the default parameters and adjust based on the stats output
If lost_spk is high relative to total segments, widen the speaker count range
If lost_bw is high, verify that the input data has bandwidth metadata in segments[].metrics.bandwidth
Use drop_fields to reduce output file size when downstream stages do not need word-level or segment-level detail

ALM Overlap Filtering: Next stage in the ALM pipeline
ALM Pipeline Concepts: Architectural overview
ALM Tutorial: End-to-end walkthrough with sample data