Overview | NeMo Curator

Curate raw audio into clean, single-speaker training segments using a configurable suite of preprocessing, segmentation, and filtering stages. Use these stages individually for targeted curation, or compose them through AudioDataFilterStage for an end-to-end pipeline driven by a single YAML config.

How it Works

A typical pipeline composes the following stages in order:

Mono conversion normalizes channels and sample rate.
Voice activity detection (VAD) splits each file into speech segments.
Band filter drops segments that are not full-band (or not narrow-band, depending on the configured target).
UTMOS filters segments below a perceived-quality threshold.
SIGMOS filters segments by per-dimension quality scores (noise, overall, signal, coloration, discontinuity, loudness, reverb).
Segment concatenation merges surviving segments back together with configurable silence between them.
Speaker separation diarizes the concatenated audio and fans out one task per speaker.
Per-speaker filters rerun VAD/Band/UTMOS/SIGMOS on each speaker’s audio independently.
Timestamp mapping projects final segment boundaries back to positions in the original input file.

Each stage is independently usable. Use AudioDataFilterStage to compose all of them with a single YAML config, or assemble a custom subset stage-by-stage.

Pipeline Stages

Preprocessing Stages

Channel normalization, segment merging, and original-file timestamp mapping mono-conversion concatenation timestamp-mapper

VAD Segmentation

Split audio into speech segments using Silero VAD silero fan-out configurable

Band Filter

Classify and filter audio by spectral bandwidth full-band narrow-band sklearn

UTMOS Filter

Filter by predicted Mean Opinion Score using utmos22_strong mos torch-hub no-reference

SIGMOS Filter

Filter by seven independent perceptual-quality dimensions onnx multi-dimensional configurable

Speaker Separation

Diarize with offline or streaming SortFormer and fan out per speaker sortformer streaming diarization

AudioDataFilterStage Composite

Single composite stage that decomposes into the full filtering pipeline from a YAML config composite yaml-config end-to-end

Quick Example

A complete VAD + UTMOS + SIGMOS + speaker separation pipeline assembled stage-by-stage:

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.backends.xenna import XennaExecutor
3 from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
4 from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
5 from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
6 from nemo_curator.stages.audio.filtering.sigmos import SIGMOSFilterStage
7 from nemo_curator.stages.audio.preprocessing.concatenation import SegmentConcatenationStage
8 from nemo_curator.stages.audio.segmentation.speaker_separation import SpeakerSeparationStage
9 from nemo_curator.stages.audio.postprocessing.timestamp_mapper import TimestampMapperStage
10 from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
11 from nemo_curator.stages.text.io.writer import JsonlWriter
12 
13 pipeline = Pipeline(name="audio_quality_filtering")
14 
15 # 1. Normalize channels and sample rate
16 pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))
17 
18 # 2. Split into speech segments
19 pipeline.add_stage(VADSegmentationStage(min_duration_sec=2.0, threshold=0.5))
20 
21 # 3. Filter by perceptual quality (drop segments with MOS < 3.5)
22 pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))
23 
24 # 4. Filter by SIGMOS noise + overall thresholds
25 pipeline.add_stage(SIGMOSFilterStage(noise_threshold=4.0, ovrl_threshold=3.5))
26 
27 # 5. Concatenate surviving segments
28 pipeline.add_stage(SegmentConcatenationStage(silence_duration_sec=0.5))
29 
30 # 6. Diarize and fan out per speaker
31 pipeline.add_stage(SpeakerSeparationStage())
32 
33 # 7. Map final boundaries back to original file timestamps
34 pipeline.add_stage(TimestampMapperStage())
35 
36 # 8. Export
37 pipeline.add_stage(AudioToDocumentStage())
38 pipeline.add_stage(JsonlWriter(path="./curated_audio"))
39 
40 executor = XennaExecutor()
41 pipeline.run(executor)

For a YAML-driven equivalent, use AudioDataFilterStage — it expands into the same pipeline from a single configuration file.

ReadSpeech Tutorial — end-to-end walkthrough of AudioDataFilterStage on the DNS Challenge ReadSpeech dataset.
Quality Assessment — WER and duration filters for ASR-based curation.
Audio Concepts — audio task model, manifests, and pipeline architecture.

How it Works

Pipeline Stages

Quick Example

Related Topics