Audio Quality Filtering
Curate raw audio into clean, single-speaker training segments using a configurable suite of preprocessing, segmentation, and filtering stages. Use these stages individually for targeted curation, or compose them through AudioDataFilterStage for an end-to-end pipeline driven by a single YAML config.
How it Works
A typical pipeline composes the following stages in order:
- Mono conversion normalizes channels and sample rate.
- Voice activity detection (VAD) splits each file into speech segments.
- Band filter drops segments that are not full-band (or not narrow-band, depending on the configured target).
- UTMOS filters segments below a perceived-quality threshold.
- SIGMOS filters segments by per-dimension quality scores (noise, overall, signal, coloration, discontinuity, loudness, reverb).
- Segment concatenation merges surviving segments back together with configurable silence between them.
- Speaker separation diarizes the concatenated audio and fans out one task per speaker.
- Per-speaker filters rerun VAD/Band/UTMOS/SIGMOS on each speaker’s audio independently.
- Timestamp mapping projects final segment boundaries back to positions in the original input file.
Each stage is independently usable. Use AudioDataFilterStage to compose all of them with a single YAML config, or assemble a custom subset stage-by-stage.
Pipeline Stages
Channel normalization, segment merging, and original-file timestamp mapping mono-conversion concatenation timestamp-mapper
Split audio into speech segments using Silero VAD silero fan-out configurable
Classify and filter audio by spectral bandwidth full-band narrow-band sklearn
Filter by predicted Mean Opinion Score using utmos22_strong mos torch-hub no-reference
Filter by seven independent perceptual-quality dimensions onnx multi-dimensional configurable
Diarize with offline or streaming SortFormer and fan out per speaker sortformer streaming diarization
Single composite stage that decomposes into the full filtering pipeline from a YAML config composite yaml-config end-to-end
Quick Example
A complete VAD + UTMOS + SIGMOS + speaker separation pipeline assembled stage-by-stage:
For a YAML-driven equivalent, use AudioDataFilterStage — it expands into the same pipeline from a single configuration file.
Related Topics
- ReadSpeech Tutorial — end-to-end walkthrough of
AudioDataFilterStageon the DNS Challenge ReadSpeech dataset. - Quality Assessment — WER and duration filters for ASR-based curation.
- Audio Concepts — audio task model, manifests, and pipeline architecture.