DNS Challenge Read Speech Tutorial
Learn how to curate the DNS Challenge Read Speech dataset (14,279 WAV files at 48 kHz, 19.3 hours total) using NeMo Curator’s AudioDataFilterStage. This tutorial walks you through automatic dataset download, end-to-end quality filtering, and segment extraction.
Overview
This tutorial demonstrates an end-to-end audio curation workflow:
- Auto-download the DNS Challenge dataset (4.88 GB compressed, 6.3 GB extracted) and build an initial manifest.
- Run
AudioDataFilterStagewith VAD, UTMOS, SIGMOS, band, and speaker-separation sub-stages. - Write a JSONL manifest of filtered single-speaker segments.
- Optionally extract segments as standalone WAV files using the bundled
extract_segments.pyutility (noffmpegdependency).
What you will learn:
- Wiring
CreateInitialManifestReadSpeechStageinto a pipeline. - Toggling individual quality filters (
--enable-vad,--enable-utmos,--enable-sigmos,--enable-band-filter,--enable-speaker-separation). - Tuning UTMOS / SIGMOS thresholds and VAD windowing.
- Choosing between Python CLI and Hydra YAML drivers.
Working Example Location
The complete working code for this tutorial is located at:
Accessing the code:
Prerequisites
- NeMo Curator installed with audio extras (
uv sync --extra audio_cuda12for GPU, oraudio_cpufor CPU-only). Refer to the Installation Guide. - Python 3.10 or later.
- ~5 GB free disk for the compressed dataset; ~10 GB total during extraction.
- Optional but recommended: a GPU with at least 8 GB of memory for VAD/UTMOS/SIGMOS/SortFormer inference.
The pipeline runs end-to-end on GPU in 1–2 hours for the full 14,279-file corpus on a single H100. For a fast smoke test, use --max-samples 10 (1–2 minutes wall clock).
Pipeline Flow
Step-by-Step Walkthrough
Step 1: Quick Validation Run
Confirm the install with a 10-sample dry run that downloads the dataset and exercises VAD + UTMOS:
Expected wall-clock time on a single GPU: 1–2 minutes, dominated by model loading. Results land under ./dns_data/result/ as a JSONL manifest.
Step 2: Review the Pipeline Configuration
The full pipeline is defined in pipeline.yaml and decomposes into four stages:
Step 3: Understand the Configuration Parameters
The following table describes the key parameters defined in pipeline.yaml:
Step 4: Run the Full Pipeline
Default sample budget is 5,000 files. To process the full corpus:
Re-run against pre-downloaded data without re-fetching:
Step 5: Drive with Hydra YAML
run.py uses Hydra to drive the same pipeline from pipeline.yaml:
Override individual sub-stage parameters from the command line:
Step 6: Inspect the Output Manifest
The pipeline writes one JSONL line per filtered segment. Each line includes the resolved timestamps, speaker ID, and the per-stage scores that survived filtering:
Inspect distributions in pandas to validate the curation:
Step 7: Extract Segments (Optional)
Use the bundled extract_segments.py utility to slice the original WAVs into per-segment files according to the resolved start_ms/end_ms timestamps:
This utility uses soundfile directly, so no ffmpeg is required for wav, flac, or ogg outputs.
Best Practices
- Start with a 10-sample run:
--max-samples 10confirms your environment in 1–2 minutes before committing to the full 1–2 hour corpus run. - Use
--enable-*flags to compose pipelines: each filter is independently toggleable. Build up from VAD only, add UTMOS, then SIGMOS, then speaker separation as needed. - Inspect distributions before tightening thresholds: run with permissive defaults (
utmos_mos_threshold=3.0), inspectutmos_mosdistribution in pandas, then re-run with the threshold you actually want. - Use Hydra for repeatable runs: configure once in
pipeline.yaml, then override individual params on the command line for sweeps. Hydra captures the resolved config under.hydra/for reproducibility. - Pre-download for offline environments: run once with
auto_download=trueto populateraw_data_dir, then use--no-auto-download(orauto_download=falsein YAML) on subsequent runs in air-gapped environments.
Related Topics
AudioDataFilterStageComposite — full configuration reference for the filtering pipeline used in this tutorial.- Audio Quality Filtering — index of the individual filter stages.
- ALM Tutorial — alternative audio-curation tutorial focused on audio-language model training data.
- Beginner Tutorial — simpler audio curation walkthrough.