Curate AudioProcess DataQuality Filtering

Audio Quality Filtering

View as Markdown

Curate raw audio into clean, single-speaker training segments using a configurable suite of preprocessing, segmentation, and filtering stages. Use these stages individually for targeted curation, or compose them through AudioDataFilterStage for an end-to-end pipeline driven by a single YAML config.

How it Works

A typical pipeline composes the following stages in order:

  1. Mono conversion normalizes channels and sample rate.
  2. Voice activity detection (VAD) splits each file into speech segments.
  3. Band filter drops segments that are not full-band (or not narrow-band, depending on the configured target).
  4. UTMOS filters segments below a perceived-quality threshold.
  5. SIGMOS filters segments by per-dimension quality scores (noise, overall, signal, coloration, discontinuity, loudness, reverb).
  6. Segment concatenation merges surviving segments back together with configurable silence between them.
  7. Speaker separation diarizes the concatenated audio and fans out one task per speaker.
  8. Per-speaker filters rerun VAD/Band/UTMOS/SIGMOS on each speaker’s audio independently.
  9. Timestamp mapping projects final segment boundaries back to positions in the original input file.

Each stage is independently usable. Use AudioDataFilterStage to compose all of them with a single YAML config, or assemble a custom subset stage-by-stage.

Pipeline Stages

Quick Example

A complete VAD + UTMOS + SIGMOS + speaker separation pipeline assembled stage-by-stage:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
4from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
5from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
6from nemo_curator.stages.audio.filtering.sigmos import SIGMOSFilterStage
7from nemo_curator.stages.audio.preprocessing.concatenation import SegmentConcatenationStage
8from nemo_curator.stages.audio.segmentation.speaker_separation import SpeakerSeparationStage
9from nemo_curator.stages.audio.postprocessing.timestamp_mapper import TimestampMapperStage
10from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
11from nemo_curator.stages.text.io.writer import JsonlWriter
12
13pipeline = Pipeline(name="audio_quality_filtering")
14
15# 1. Normalize channels and sample rate
16pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))
17
18# 2. Split into speech segments
19pipeline.add_stage(VADSegmentationStage(min_duration_sec=2.0, threshold=0.5))
20
21# 3. Filter by perceptual quality (drop segments with MOS < 3.5)
22pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))
23
24# 4. Filter by SIGMOS noise + overall thresholds
25pipeline.add_stage(SIGMOSFilterStage(noise_threshold=4.0, ovrl_threshold=3.5))
26
27# 5. Concatenate surviving segments
28pipeline.add_stage(SegmentConcatenationStage(silence_duration_sec=0.5))
29
30# 6. Diarize and fan out per speaker
31pipeline.add_stage(SpeakerSeparationStage())
32
33# 7. Map final boundaries back to original file timestamps
34pipeline.add_stage(TimestampMapperStage())
35
36# 8. Export
37pipeline.add_stage(AudioToDocumentStage())
38pipeline.add_stage(JsonlWriter(path="./curated_audio"))
39
40executor = XennaExecutor()
41pipeline.run(executor)

For a YAML-driven equivalent, use AudioDataFilterStage — it expands into the same pipeline from a single configuration file.

  • ReadSpeech Tutorial — end-to-end walkthrough of AudioDataFilterStage on the DNS Challenge ReadSpeech dataset.
  • Quality Assessment — WER and duration filters for ASR-based curation.
  • Audio Concepts — audio task model, manifests, and pipeline architecture.