AudioDataFilterStage Composite Pipeline
AudioDataFilterStage is a CompositeStage that decomposes into a configurable sequence of audio sub-stages for extracting clean single-speaker segments from raw audio files. Use it when you want the full quality-filtering pipeline driven by a single YAML config instead of wiring stages individually.
Understanding the Composite
What It Does
A CompositeStage is a stage that, at pipeline build time, expands into a sequence of underlying stages. AudioDataFilterStage expands into the audio quality-filtering chain — preprocessing, VAD, band, UTMOS, SIGMOS, concatenation, speaker separation, per-speaker filters, and timestamp mapping — using parameters loaded from a YAML config.
This serves two purposes:
- Single-config pipelines: tune the entire pipeline in one place instead of editing many
pipeline.add_stage(...)calls. - Resource declarations live with the stage: each sub-stage’s CPU/GPU allocation is set in the same YAML, alongside its functional parameters.
Default Pipeline Order
When all sub-stages are enabled, the composite expands into:
MonoConversionStage— normalize channels and sample rate.VADSegmentationStage— split into speech segments.BandFilterStage— drop segments not matching the target bandwidth.UTMOSFilterStage— drop segments below the MOS threshold.SIGMOSFilterStage— drop segments failing any active SIGMOS dimension.SegmentConcatenationStage— concatenate surviving segments with silence gaps.SpeakerSeparationStage— diarize and fan out one task per speaker.- Per-speaker filters — rerun VAD + Band + UTMOS + SIGMOS on each speaker’s audio.
TimestampMapperStage— project final boundaries back to original-file timestamps.
When to Use the Composite vs Individual Stages
Basic Usage
Step 1: Pick a Config Source
Construct the stage from a YAML config file:
Or pass a config dict inline:
If neither config_path nor config is provided, the bundled default config is used.
Step 2: Customize the YAML
Each top-level key maps to one sub-stage; set enable: false to skip it. The default configuration shipped with the stage:
The parameters mirror the sub-stage’s constructor arguments. See the per-stage pages linked at the bottom for parameter details.
The default UTMOS threshold in the YAML config is 3.4, while the standalone UTMOSFilterStage class default is 3.5. The composite uses the YAML value when constructed from the bundled config; tune as needed for your data.
Step 3: Disable Unneeded Stages
Each sub-stage with enable: accepts false to skip it. Common partial pipelines:
Common Configurations
High-Quality TTS Training Data
Strict thresholds across all dimensions, no narrowband, no high-reverb:
Permissive Web-Crawl Curation
Looser thresholds; preserve more data; rely on downstream training to filter further:
ASR Training (Single-Speaker Read Speech)
Skip speaker separation since each file is known to have one speaker:
Complete Pipeline Example
A pipeline that uses AudioDataFilterStage as the entire processing chain:
For a complete end-to-end walkthrough including dataset download, see the ReadSpeech Tutorial.
Best Practices
- Start from the default config and tune one knob at a time: don’t tighten thresholds on five dimensions at once. You’ll lose visibility into which one rejected each dropped segment.
- Disable speaker separation when you can: it’s the most expensive sub-stage. If your input has known single-speaker audio, set
speaker_separation.enable: falsefor a substantial speedup. - Match resources to hardware: the
cpus/gpuskeys per sub-stage control parallelism. On a 16-CPU / 4-GPU node, the defaults work well; tune up for larger nodes. - Use
strict_sample_rate: falseonly when needed: auto-resampling can mask data-quality bugs (unexpected 8 kHz audio in a 48 kHz dataset). Default to strict and disable only when heterogeneity is expected. - Inspect distributions before tightening thresholds: route a small sample through with most filters disabled to score the data, then pick thresholds from the percentile distributions.
Related Topics
- Preprocessing Stages —
MonoConversionStage,SegmentConcatenationStage,TimestampMapperStage. - VAD, Band Filter, UTMOS, SIGMOS, Speaker Separation — per-stage details.
- ReadSpeech Tutorial — end-to-end walkthrough.