Curate AudioTutorials

DNS Challenge Read Speech Tutorial

View as Markdown

Learn how to curate the DNS Challenge Read Speech dataset (14,279 WAV files at 48 kHz, 19.3 hours total) using NeMo Curator’s AudioDataFilterStage. This tutorial walks you through automatic dataset download, end-to-end quality filtering, and segment extraction.

Overview

This tutorial demonstrates an end-to-end audio curation workflow:

  1. Auto-download the DNS Challenge dataset (4.88 GB compressed, 6.3 GB extracted) and build an initial manifest.
  2. Run AudioDataFilterStage with VAD, UTMOS, SIGMOS, band, and speaker-separation sub-stages.
  3. Write a JSONL manifest of filtered single-speaker segments.
  4. Optionally extract segments as standalone WAV files using the bundled extract_segments.py utility (no ffmpeg dependency).

What you will learn:

  • Wiring CreateInitialManifestReadSpeechStage into a pipeline.
  • Toggling individual quality filters (--enable-vad, --enable-utmos, --enable-sigmos, --enable-band-filter, --enable-speaker-separation).
  • Tuning UTMOS / SIGMOS thresholds and VAD windowing.
  • Choosing between Python CLI and Hydra YAML drivers.

Working Example Location

The complete working code for this tutorial is located at:

<nemo_curator_repository>/tutorials/audio/readspeech/
├── README.md # Tutorial documentation
├── pipeline.py # argparse CLI driver
├── pipeline.yaml # Hydra config (full pipeline)
├── run.py # Hydra runner
└── extract_segments.py # Post-processing utility

Accessing the code:

$git clone https://github.com/NVIDIA-NeMo/Curator.git
$cd Curator/tutorials/audio/readspeech/

Prerequisites

  • NeMo Curator installed with audio extras (uv sync --extra audio_cuda12 for GPU, or audio_cpu for CPU-only). Refer to the Installation Guide.
  • Python 3.10 or later.
  • ~5 GB free disk for the compressed dataset; ~10 GB total during extraction.
  • Optional but recommended: a GPU with at least 8 GB of memory for VAD/UTMOS/SIGMOS/SortFormer inference.

The pipeline runs end-to-end on GPU in 1–2 hours for the full 14,279-file corpus on a single H100. For a fast smoke test, use --max-samples 10 (1–2 minutes wall clock).

Pipeline Flow

CreateInitialManifestReadSpeechStage (download + manifest)
AudioDataFilterStage (Mono → VAD → Band → UTMOS → SIGMOS → Concat → SpeakerSep → ... → TimestampMapper)
AudioToDocumentStage → JsonlWriter (manifest.jsonl)
extract_segments.py (optional — write segment WAVs to disk)

Step-by-Step Walkthrough

Step 1: Quick Validation Run

Confirm the install with a 10-sample dry run that downloads the dataset and exercises VAD + UTMOS:

$python pipeline.py \
> --raw_data_dir ./dns_data \
> --max-samples 10 \
> --enable-utmos \
> --enable-vad

Expected wall-clock time on a single GPU: 1–2 minutes, dominated by model loading. Results land under ./dns_data/result/ as a JSONL manifest.

Step 2: Review the Pipeline Configuration

The full pipeline is defined in pipeline.yaml and decomposes into four stages:

1processors:
2 # Stage 0: Download dataset and create manifest
3 - _target_: nemo_curator.stages.audio.datasets.readspeech.CreateInitialManifestReadSpeechStage
4 raw_data_dir: ${raw_data_dir}
5 max_samples: ${max_samples}
6 auto_download: ${auto_download}
7
8 # Stage 1: Apply audio filtering pipeline
9 - _target_: nemo_curator.stages.audio.AudioDataFilterStage
10 config:
11 mono_conversion:
12 output_sample_rate: ${sample_rate}
13 vad:
14 enable: ${enable_vad}
15 min_duration_sec: ${vad_min_duration_sec}
16 max_duration_sec: ${vad_max_duration_sec}
17 threshold: ${vad_threshold}
18 band_filter:
19 enable: ${enable_band_filter}
20 band_value: ${band_value}
21 utmos:
22 enable: ${enable_utmos}
23 mos_threshold: ${utmos_mos_threshold}
24 sigmos:
25 enable: ${enable_sigmos}
26 noise_threshold: ${sigmos_noise_threshold}
27 ovrl_threshold: ${sigmos_ovrl_threshold}
28 speaker_separation:
29 enable: ${enable_speaker_separation}
30 timestamp_mapper: {}
31
32 # Stage 2: Convert AudioTask → DocumentBatch
33 - _target_: nemo_curator.stages.audio.io.convert.AudioToDocumentStage
34
35 # Stage 3: Write JSONL manifest with UTF-8 preserved
36 - _target_: nemo_curator.stages.text.io.writer.JsonlWriter
37 path: ${output_dir}
38 write_kwargs:
39 force_ascii: false

Step 3: Understand the Configuration Parameters

The following table describes the key parameters defined in pipeline.yaml:

ParameterDefaultDescription
raw_data_dirrequiredWhere to download the dataset (or where it already lives if auto_download=false).
output_dir${raw_data_dir}/resultWhere to write the JSONL manifest.
max_samples-1Number of files to process; -1 processes all 14,279.
execution_modestreamingbatch runs stages sequentially; streaming runs concurrently (needs enough GPU memory for all stages at once).
sample_rate48000Target sample rate for MonoConversionStage.
vad_threshold0.5Silero VAD confidence threshold.
utmos_mos_threshold3.4Drop segments with predicted MOS below this.
sigmos_noise_threshold4.0Drop segments with SIGMOS noise score below this.
sigmos_ovrl_threshold3.5Drop segments with SIGMOS overall score below this.

Step 4: Run the Full Pipeline

Default sample budget is 5,000 files. To process the full corpus:

$python pipeline.py \
> --raw_data_dir ./dns_data \
> --max-samples -1 \
> --enable-utmos \
> --enable-vad \
> --enable-sigmos \
> --enable-band-filter \
> --enable-speaker-separation

Re-run against pre-downloaded data without re-fetching:

$python pipeline.py \
> --raw_data_dir /path/to/existing/read_speech \
> --no-auto-download \
> --enable-utmos

Step 5: Drive with Hydra YAML

run.py uses Hydra to drive the same pipeline from pipeline.yaml:

$# Default settings
$python run.py --config-name pipeline raw_data_dir=./dns_data
$
$# Process 1,000 samples
$python run.py --config-name pipeline raw_data_dir=./dns_data max_samples=1000

Override individual sub-stage parameters from the command line:

$# Looser MOS threshold; disable SIGMOS
$python run.py --config-name pipeline \
> raw_data_dir=./dns_data \
> utmos_mos_threshold=3.0 \
> enable_sigmos=false

Step 6: Inspect the Output Manifest

The pipeline writes one JSONL line per filtered segment. Each line includes the resolved timestamps, speaker ID, and the per-stage scores that survived filtering:

1{
2 "audio_filepath": "/data/dns_data/read_speech/book_42_reader_0.wav",
3 "start_ms": 1500,
4 "end_ms": 4500,
5 "speaker_id": 0,
6 "num_speakers": 1,
7 "duration_sec": 3.0,
8 "utmos_mos": 4.21,
9 "sigmos_noise": 4.55,
10 "sigmos_ovrl": 4.10,
11 "band_prediction": "full_band"
12}

Inspect distributions in pandas to validate the curation:

1import pandas as pd
2
3df = pd.read_json("./dns_data/result/manifest.jsonl", lines=True)
4print(df.describe())
5print(df["utmos_mos"].quantile([0.1, 0.5, 0.9]))

Step 7: Extract Segments (Optional)

Use the bundled extract_segments.py utility to slice the original WAVs into per-segment files according to the resolved start_ms/end_ms timestamps:

$python extract_segments.py \
> --manifest ./dns_data/result/manifest.jsonl \
> --output-dir ./dns_data/segments

This utility uses soundfile directly, so no ffmpeg is required for wav, flac, or ogg outputs.

Best Practices

  • Start with a 10-sample run: --max-samples 10 confirms your environment in 1–2 minutes before committing to the full 1–2 hour corpus run.
  • Use --enable-* flags to compose pipelines: each filter is independently toggleable. Build up from VAD only, add UTMOS, then SIGMOS, then speaker separation as needed.
  • Inspect distributions before tightening thresholds: run with permissive defaults (utmos_mos_threshold=3.0), inspect utmos_mos distribution in pandas, then re-run with the threshold you actually want.
  • Use Hydra for repeatable runs: configure once in pipeline.yaml, then override individual params on the command line for sweeps. Hydra captures the resolved config under .hydra/ for reproducibility.
  • Pre-download for offline environments: run once with auto_download=true to populate raw_data_dir, then use --no-auto-download (or auto_download=false in YAML) on subsequent runs in air-gapped environments.