Preprocessing Stages
Three lightweight stages handle the common audio plumbing tasks: collapsing channels, joining segments after filtering, and projecting filtered timestamps back to the original input file. Together they form the scaffolding around the heavier filtering stages — mono conversion runs first, segment concatenation re-merges surviving segments after filtering, and timestamp mapping closes the loop by projecting final boundaries back to source-file positions.
Stage Roles
MonoConversionStage
Converts multi-channel audio to mono and verifies that the input sample rate matches output_sample_rate. Place it as the first stage in any quality-filtering pipeline so downstream stages can assume a consistent waveform shape.
Usage
Parameters
Choosing strict_sample_rate
Set output_sample_rate=48000 for full-band audio, 16000 for narrow-band / telephony, or match your downstream model’s training rate.
SegmentConcatenationStage
Concatenates a list of speech segments produced by an earlier VAD/filter stage back into a single waveform with configurable silence between segments. Emits a mappings field that records the original-file boundaries of each segment so TimestampMapperStage can resolve final timestamps later.
Usage
Parameters
Output Mappings
After concatenation, each output AudioTask carries a mappings field — a list of dicts with one entry per concatenated segment:
The mappings list is what TimestampMapperStage uses to project final filtered boundaries back to the original source file.
Choosing silence_duration_sec
TimestampMapperStage
Resolves segment positions in the concatenated waveform back to positions in the original source file. Place it at the end of the pipeline so downstream consumers see timestamps relative to the input audio, not the intermediate concatenation.
Usage
Parameters
Why Pass-Through Keys Matter
After a chain like Concat → SpeakerSep → VAD → UTMOS, each segment carries fields added by intermediate stages (speaker_id from speaker separation, utmos_mos from UTMOS, etc.). Without passthrough_keys, TimestampMapperStage only writes the resolved timestamps and drops everything else. List the fields you need preserved:
Complete Preprocessing Example
A pipeline that uses all three stages together with VAD + UTMOS in between:
Best Practices
- Mono first, always: every downstream stage assumes a consistent waveform shape.
MonoConversionStageis mandatory at the start of any pipeline that uses VAD, UTMOS, SIGMOS, or speaker separation. - Use
strict_sample_rate=Trueuntil you have evidence it’s wrong: catching unexpected rates early is better than silently resampling and getting subtly worse results downstream. - Pass through fields explicitly:
TimestampMapperStageis the closing stage — list everything you want preserved inpassthrough_keys. It’s easier than adding a downstream stage to merge them back. - Skip concatenation if you want individual-segment manifests: if your downstream training pipeline reads one segment at a time, you don’t need to concatenate. Run VAD → quality filters → directly to writer; skip both
SegmentConcatenationStageandTimestampMapperStage.
Related Topics
- VAD Segmentation — produces the segments concatenation re-merges.
- Speaker Separation — typical stage between concatenation and the per-speaker filters.
AudioDataFilterStageComposite — composes mono conversion + concatenation + timestamp mapping into the standard pipeline automatically.