Learn how to curate the DNS Challenge Read Speech dataset (14,279 WAV files at 48 kHz, 19.3 hours total) using NeMo Curator’s AudioDataFilterStage. This tutorial walks you through automatic dataset download, end-to-end quality filtering, and segment extraction.
This tutorial demonstrates an end-to-end audio curation workflow:
AudioDataFilterStage with VAD, UTMOS, SIGMOS, band, and speaker-separation sub-stages.extract_segments.py utility (no ffmpeg dependency).What you will learn:
CreateInitialManifestReadSpeechStage into a pipeline.--enable-vad, --enable-utmos, --enable-sigmos, --enable-band-filter, --enable-speaker-separation).The complete working code for this tutorial is located at:
Accessing the code:
uv sync --extra audio_cuda12 for GPU, or audio_cpu for CPU-only). Refer to the Installation Guide.The pipeline runs end-to-end on GPU in 1–2 hours for the full 14,279-file corpus on a single H100. For a fast smoke test, use --max-samples 10 (1–2 minutes wall clock).
Confirm the install with a 10-sample dry run that downloads the dataset and exercises VAD + UTMOS:
Expected wall-clock time on a single GPU: 1–2 minutes, dominated by model loading. Results land under ./dns_data/result/ as a JSONL manifest.
The full pipeline is defined in pipeline.yaml and decomposes into four stages:
The following table describes the key parameters defined in pipeline.yaml:
Default sample budget is 5,000 files. To process the full corpus:
Re-run against pre-downloaded data without re-fetching:
run.py uses Hydra to drive the same pipeline from pipeline.yaml:
Override individual sub-stage parameters from the command line:
The pipeline writes one JSONL line per filtered segment. Each line includes the resolved timestamps, speaker ID, and the per-stage scores that survived filtering:
Inspect distributions in pandas to validate the curation:
Use the bundled extract_segments.py utility to slice the original WAVs into per-segment files according to the resolved start_ms/end_ms timestamps:
This utility uses soundfile directly, so no ffmpeg is required for wav, flac, or ogg outputs.
--max-samples 10 confirms your environment in 1–2 minutes before committing to the full 1–2 hour corpus run.--enable-* flags to compose pipelines: each filter is independently toggleable. Build up from VAD only, add UTMOS, then SIGMOS, then speaker separation as needed.utmos_mos_threshold=3.0), inspect utmos_mos distribution in pandas, then re-run with the threshold you actually want.pipeline.yaml, then override individual params on the command line for sweeps. Hydra captures the resolved config under .hydra/ for reproducibility.auto_download=true to populate raw_data_dir, then use --no-auto-download (or auto_download=false in YAML) on subsequent runs in air-gapped environments.AudioDataFilterStage Composite — full configuration reference for the filtering pipeline used in this tutorial.