For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
      • Overview
        • Overview
        • Preprocessing Stages
        • VAD Segmentation
        • Band Filter
        • UTMOS Filter
        • SIGMOS Filter
        • Speaker Separation
        • AudioDataFilterStage Composite
      • Text Integration
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • How it Works
  • Pipeline Stages
  • Quick Example
  • Related Topics
Curate AudioProcess DataQuality Filtering

Audio Quality Filtering

||View as Markdown|
Previous

Duration Filtering

Next

Preprocessing Stages

Curate raw audio into clean, single-speaker training segments using a configurable suite of preprocessing, segmentation, and filtering stages. Use these stages individually for targeted curation, or compose them through AudioDataFilterStage for an end-to-end pipeline driven by a single YAML config.

How it Works

A typical pipeline composes the following stages in order:

  1. Mono conversion normalizes channels and sample rate.
  2. Voice activity detection (VAD) splits each file into speech segments.
  3. Band filter drops segments that are not full-band (or not narrow-band, depending on the configured target).
  4. UTMOS filters segments below a perceived-quality threshold.
  5. SIGMOS filters segments by per-dimension quality scores (noise, overall, signal, coloration, discontinuity, loudness, reverb).
  6. Segment concatenation merges surviving segments back together with configurable silence between them.
  7. Speaker separation diarizes the concatenated audio and fans out one task per speaker.
  8. Per-speaker filters rerun VAD/Band/UTMOS/SIGMOS on each speaker’s audio independently.
  9. Timestamp mapping projects final segment boundaries back to positions in the original input file.

Each stage is independently usable. Use AudioDataFilterStage to compose all of them with a single YAML config, or assemble a custom subset stage-by-stage.

Pipeline Stages

Preprocessing Stages

Channel normalization, segment merging, and original-file timestamp mapping mono-conversion concatenation timestamp-mapper

VAD Segmentation

Split audio into speech segments using Silero VAD silero fan-out configurable

Band Filter

Classify and filter audio by spectral bandwidth full-band narrow-band sklearn

UTMOS Filter

Filter by predicted Mean Opinion Score using utmos22_strong mos torch-hub no-reference

SIGMOS Filter

Filter by seven independent perceptual-quality dimensions onnx multi-dimensional configurable

Speaker Separation

Diarize with offline or streaming SortFormer and fan out per speaker sortformer streaming diarization

AudioDataFilterStage Composite

Single composite stage that decomposes into the full filtering pipeline from a YAML config composite yaml-config end-to-end

Quick Example

A complete VAD + UTMOS + SIGMOS + speaker separation pipeline assembled stage-by-stage:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
4from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
5from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
6from nemo_curator.stages.audio.filtering.sigmos import SIGMOSFilterStage
7from nemo_curator.stages.audio.preprocessing.concatenation import SegmentConcatenationStage
8from nemo_curator.stages.audio.segmentation.speaker_separation import SpeakerSeparationStage
9from nemo_curator.stages.audio.postprocessing.timestamp_mapper import TimestampMapperStage
10from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
11from nemo_curator.stages.text.io.writer import JsonlWriter
12
13pipeline = Pipeline(name="audio_quality_filtering")
14
15# 1. Normalize channels and sample rate
16pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))
17
18# 2. Split into speech segments
19pipeline.add_stage(VADSegmentationStage(min_duration_sec=2.0, threshold=0.5))
20
21# 3. Filter by perceptual quality (drop segments with MOS < 3.5)
22pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))
23
24# 4. Filter by SIGMOS noise + overall thresholds
25pipeline.add_stage(SIGMOSFilterStage(noise_threshold=4.0, ovrl_threshold=3.5))
26
27# 5. Concatenate surviving segments
28pipeline.add_stage(SegmentConcatenationStage(silence_duration_sec=0.5))
29
30# 6. Diarize and fan out per speaker
31pipeline.add_stage(SpeakerSeparationStage())
32
33# 7. Map final boundaries back to original file timestamps
34pipeline.add_stage(TimestampMapperStage())
35
36# 8. Export
37pipeline.add_stage(AudioToDocumentStage())
38pipeline.add_stage(JsonlWriter(path="./curated_audio"))
39
40executor = XennaExecutor()
41pipeline.run(executor)

For a YAML-driven equivalent, use AudioDataFilterStage — it expands into the same pipeline from a single configuration file.

Related Topics

  • ReadSpeech Tutorial — end-to-end walkthrough of AudioDataFilterStage on the DNS Challenge ReadSpeech dataset.
  • Quality Assessment — WER and duration filters for ASR-based curation.
  • Audio Concepts — audio task model, manifests, and pipeline architecture.