Curate AudioProcess DataALM Data Curation

ALM Data Builder

View as Markdown

ALMDataBuilderStage constructs candidate training windows from consecutive diarized audio segments. Each window must meet configurable constraints for duration, sample rate, bandwidth, and speaker count before it is included in the output.

How it Works

For each input manifest entry, the stage:

  1. Checks the entry-level sample rate against min_sample_rate (skips the entire entry if below threshold)
  2. Iterates over segments as potential window starting points
  3. For each starting segment, appends consecutive segments until the accumulated duration reaches the target
  4. Optionally truncates the final segment if the window exceeds the maximum duration
  5. Validates that the window contains the required number of distinct speakers
  6. Records loss statistics for segments that do not meet constraints

The stage processes one AudioTask at a time and mutates the task data in place.

Parameters

ParameterTypeDefaultDescription
target_window_durationfloat120.0Target window length in seconds
tolerancefloat0.1Fractional deviation from target duration. A value of 0.1 produces a range of 108 to 132 seconds for a 120-second target.
min_bandwidthint8,000Minimum bandwidth in Hz per segment. Segments below this threshold are excluded.
min_sample_rateint16,000Minimum audio sample rate in Hz. Entries below this threshold are skipped entirely.
min_speakersint2Minimum distinct speakers required per window
max_speakersint5Maximum distinct speakers allowed per window
truncationboolTrueWhether to truncate the final segment when a window exceeds the maximum duration
drop_fieldsstr"words"Comma-separated segment-level fields to remove from output
drop_fields_top_levelstr"words,segments"Comma-separated entry-level fields to remove from output

Basic Usage

1from nemo_curator.stages.audio.alm import ALMDataBuilderStage
2
3# Default configuration: 120-second windows, 2-5 speakers
4builder = ALMDataBuilderStage()

Advanced Configuration

Short Windows for Fine-Tuning

1builder = ALMDataBuilderStage(
2 target_window_duration=30.0,
3 tolerance=0.2, # 24-36 seconds
4 min_speakers=2,
5 max_speakers=3,
6)

Permissive Quality Thresholds

1builder = ALMDataBuilderStage(
2 min_bandwidth=4000, # Accept lower-quality audio
3 min_sample_rate=8000, # Accept telephone-quality audio
4 min_speakers=1, # Allow single-speaker windows
5 max_speakers=10,
6)

Preserving Segment Fields

By default, the stage drops words from segments and words,segments from the top level. To preserve all fields:

1builder = ALMDataBuilderStage(
2 drop_fields="",
3 drop_fields_top_level="",
4)

Output Fields

The stage adds the following fields to each AudioTask:

FieldTypeDescription
windowslistCandidate windows, each containing segments and speaker_durations
statsdictLoss statistics tracking why segments were excluded
truncation_eventsintNumber of segments that were truncated

Speaker Durations

Each window includes a speaker_durations array containing the total speaking time of the top five speakers, sorted by duration in descending order. The array is zero-padded to length five when a window has fewer than five speakers.

Loss Statistics

The stats dictionary contains the following counters:

StatisticDescription
total_segmentsTotal input segments processed
total_durTotal input duration in seconds
lost_bwSegments excluded for low bandwidth
lost_srEntries excluded for low sample rate
lost_spkWindows excluded for speaker count outside range
lost_winWindows excluded for duration outside tolerance
lost_no_spkrWindows lost where growth was blocked by a segment without a speaker label (sub-category of lost_win)
lost_next_seg_bmWindows lost where growth was blocked by a low-bandwidth segment (sub-category of lost_win)

Best Practices

  • Start with the default parameters and adjust based on the stats output
  • If lost_spk is high relative to total segments, widen the speaker count range
  • If lost_bw is high, verify that the input data has bandwidth metadata in segments[].metrics.bandwidth
  • Use drop_fields to reduce output file size when downstream stages do not need word-level or segment-level detail