ALMDataBuilderStage constructs candidate training windows from consecutive diarized audio segments. Each window must meet configurable constraints for duration, sample rate, bandwidth, and speaker count before it is included in the output.
For each input manifest entry, the stage:
min_sample_rate (skips the entire entry if below threshold)The stage processes one AudioTask at a time and mutates the task data in place.
By default, the stage drops words from segments and words,segments from the top level. To preserve all fields:
The stage adds the following fields to each AudioTask:
Each window includes a speaker_durations array containing the total speaking time of the top five speakers, sorted by duration in descending order. The array is zero-padded to length five when a window has fewer than five speakers.
The stats dictionary contains the following counters:
stats outputlost_spk is high relative to total segments, widen the speaker count rangelost_bw is high, verify that the input data has bandwidth metadata in segments[].metrics.bandwidthdrop_fields to reduce output file size when downstream stages do not need word-level or segment-level detail