Curate raw audio into clean, single-speaker training segments using a configurable suite of preprocessing, segmentation, and filtering stages. Use these stages individually for targeted curation, or compose them through AudioDataFilterStage for an end-to-end pipeline driven by a single YAML config.
A typical pipeline composes the following stages in order:
Each stage is independently usable. Use AudioDataFilterStage to compose all of them with a single YAML config, or assemble a custom subset stage-by-stage.
Channel normalization, segment merging, and original-file timestamp mapping mono-conversion concatenation timestamp-mapper
Split audio into speech segments using Silero VAD silero fan-out configurable
Classify and filter audio by spectral bandwidth full-band narrow-band sklearn
Filter by predicted Mean Opinion Score using utmos22_strong mos torch-hub no-reference
Filter by seven independent perceptual-quality dimensions onnx multi-dimensional configurable
Diarize with offline or streaming SortFormer and fan out per speaker sortformer streaming diarization
Single composite stage that decomposes into the full filtering pipeline from a YAML config composite yaml-config end-to-end
A complete VAD + UTMOS + SIGMOS + speaker separation pipeline assembled stage-by-stage:
For a YAML-driven equivalent, use AudioDataFilterStage — it expands into the same pipeline from a single configuration file.
AudioDataFilterStage on the DNS Challenge ReadSpeech dataset.