Three lightweight stages handle the common audio plumbing tasks: collapsing channels, joining segments after filtering, and projecting filtered timestamps back to the original input file. Together they form the scaffolding around the heavier filtering stages — mono conversion runs first, segment concatenation re-merges surviving segments after filtering, and timestamp mapping closes the loop by projecting final boundaries back to source-file positions.
MonoConversionStageConverts multi-channel audio to mono and verifies that the input sample rate matches output_sample_rate. Place it as the first stage in any quality-filtering pipeline so downstream stages can assume a consistent waveform shape.
strict_sample_rateSet output_sample_rate=48000 for full-band audio, 16000 for narrow-band / telephony, or match your downstream model’s training rate.
SegmentConcatenationStageConcatenates a list of speech segments produced by an earlier VAD/filter stage back into a single waveform with configurable silence between segments. Emits a mappings field that records the original-file boundaries of each segment so TimestampMapperStage can resolve final timestamps later.
After concatenation, each output AudioTask carries a mappings field — a list of dicts with one entry per concatenated segment:
The mappings list is what TimestampMapperStage uses to project final filtered boundaries back to the original source file.
silence_duration_secTimestampMapperStageResolves segment positions in the concatenated waveform back to positions in the original source file. Place it at the end of the pipeline so downstream consumers see timestamps relative to the input audio, not the intermediate concatenation.
After a chain like Concat → SpeakerSep → VAD → UTMOS, each segment carries fields added by intermediate stages (speaker_id from speaker separation, utmos_mos from UTMOS, etc.). Without passthrough_keys, TimestampMapperStage only writes the resolved timestamps and drops everything else. List the fields you need preserved:
A pipeline that uses all three stages together with VAD + UTMOS in between:
MonoConversionStage is mandatory at the start of any pipeline that uses VAD, UTMOS, SIGMOS, or speaker separation.strict_sample_rate=True until you have evidence it’s wrong: catching unexpected rates early is better than silently resampling and getting subtly worse results downstream.TimestampMapperStage is the closing stage — list everything you want preserved in passthrough_keys. It’s easier than adding a downstream stage to merge them back.SegmentConcatenationStage and TimestampMapperStage.AudioDataFilterStage Composite — composes mono conversion + concatenation + timestamp mapping into the standard pipeline automatically.