AudioDataFilterStage is a CompositeStage that decomposes into a configurable sequence of audio sub-stages for extracting clean single-speaker segments from raw audio files. Use it when you want the full quality-filtering pipeline driven by a single YAML config instead of wiring stages individually.
A CompositeStage is a stage that, at pipeline build time, expands into a sequence of underlying stages. AudioDataFilterStage expands into the audio quality-filtering chain — preprocessing, VAD, band, UTMOS, SIGMOS, concatenation, speaker separation, per-speaker filters, and timestamp mapping — using parameters loaded from a YAML config.
This serves two purposes:
pipeline.add_stage(...) calls.When all sub-stages are enabled, the composite expands into:
MonoConversionStage — normalize channels and sample rate.VADSegmentationStage — split into speech segments.BandFilterStage — drop segments not matching the target bandwidth.UTMOSFilterStage — drop segments below the MOS threshold.SIGMOSFilterStage — drop segments failing any active SIGMOS dimension.SegmentConcatenationStage — concatenate surviving segments with silence gaps.SpeakerSeparationStage — diarize and fan out one task per speaker.TimestampMapperStage — project final boundaries back to original-file timestamps.Construct the stage from a YAML config file:
Or pass a config dict inline:
If neither config_path nor config is provided, the bundled default config is used.
Each top-level key maps to one sub-stage; set enable: false to skip it. The default configuration shipped with the stage:
The parameters mirror the sub-stage’s constructor arguments. See the per-stage pages linked at the bottom for parameter details.
The default UTMOS threshold in the YAML config is 3.4, while the standalone UTMOSFilterStage class default is 3.5. The composite uses the YAML value when constructed from the bundled config; tune as needed for your data.
Each sub-stage with enable: accepts false to skip it. Common partial pipelines:
Strict thresholds across all dimensions, no narrowband, no high-reverb:
Looser thresholds; preserve more data; rely on downstream training to filter further:
Skip speaker separation since each file is known to have one speaker:
A pipeline that uses AudioDataFilterStage as the entire processing chain:
For a complete end-to-end walkthrough including dataset download, see the ReadSpeech Tutorial.
speaker_separation.enable: false for a substantial speedup.cpus / gpus keys per sub-stage control parallelism. On a 16-CPU / 4-GPU node, the defaults work well; tune up for larger nodes.strict_sample_rate: false only when needed: auto-resampling can mask data-quality bugs (unexpected 8 kHz audio in a 48 kHz dataset). Default to strict and disable only when heterogeneity is expected.MonoConversionStage, SegmentConcatenationStage, TimestampMapperStage.