Curate AudioProcess DataQuality Filtering

UTMOS Filter

View as Markdown

Filter audio segments based on their predicted Mean Opinion Score (MOS) using the utmos22_strong model. UTMOS is the primary perceptual-quality predictor in the audio quality-filtering pipeline.

Understanding UTMOS

What MOS Measures

Mean Opinion Score is a 1.0–5.0 perceptual-quality scale originally defined for human listening tests. UTMOS is a learned no-reference predictor that produces an MOS estimate directly from waveform input — no clean reference signal required, unlike PESQ or POLQA.

MOS RangeQuality LevelRecommended Use
4.0–5.0ExcellentHigh-quality TTS / voice cloning training data
3.5–4.0GoodGeneral ASR / TTS training (default threshold range)
3.0–3.5AcceptablePermissive thresholds for large web-scraped datasets
2.0–3.0PoorReview required; usually filtered out
< 2.0BadStrong candidate for removal

A common starting point is mos_threshold=3.5 — drops obviously distorted, noisy, or clipped audio while keeping most usable training material.

When to Use UTMOS vs SIGMOS

  • UTMOS produces a single composite quality score. Use it as the first cheap filter to drop obviously-bad audio.
  • SIGMOS produces seven independent dimension scores (noise, signal, reverb, etc.). Use it after UTMOS for fine-grained control over which kinds of degradation to allow.

In a typical pipeline both are stacked: UTMOS first as a coarse cut, SIGMOS second to enforce specific quality requirements.

Basic UTMOS Filtering

Step 1: Configure the Stage

1from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
2
3utmos = UTMOSFilterStage(mos_threshold=3.5)
4
5pipeline.add_stage(utmos)

The stage accepts either an in-memory waveform (waveform + sample_rate) or a path (audio_filepath). Multi-channel input is automatically converted to mono, and any sample rate is resampled to 16 kHz before scoring.

Step 2: Inspect the MOS Distribution Before Filtering

For unfamiliar datasets, run UTMOS in score-only mode first by setting mos_threshold=None:

1# Score every segment without filtering
2pipeline.add_stage(UTMOSFilterStage(mos_threshold=None))

Export the resulting manifest with AudioToDocumentStage + JsonlWriter, then plot the utmos_mos distribution (in pandas, numpy, or your preferred tool) before choosing a real threshold. This avoids over-filtering datasets that are systematically lower-quality than UTMOS’s training distribution.

Step 3: Apply the Tuned Threshold

1pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))

Segments with predicted MOS below mos_threshold are dropped; segments at or above the threshold pass through unchanged.

Parameters

ParameterTypeDefaultDescription
mos_thresholdfloat | None3.5Minimum MOS to keep. Set to None to score without filtering (useful for distribution analysis).
sample_rateint16000Target sample rate for UTMOS inference. The model is trained at 16 kHz; do not change unless you have a custom checkpoint.

The default resource allocation is Resources(cpus=1.0, gpus=0.5). UTMOS is small; fractional-GPU allocation lets it share a device with other inference stages.

Behavior Notes

  • Model fetch: the model is downloaded via torch.hub from tarepan/SpeechMOS:v1.2.0 on first use.
  • Offline environments: if torch.hub access is unavailable, the stage logs the error and passes the input through unchanged. Pre-cache the model in an air-gapped environment by setting the TORCH_HOME environment variable.
  • Multi-channel handling: stereo and multi-channel input is converted to mono internally before scoring; you do not need to insert MonoConversionStage solely for UTMOS.

Domain-Specific Tuning

Voice Cloning / TTS

TTS training quality is sensitive to background noise, breath sounds, and clipping. Use a strict threshold:

1UTMOSFilterStage(mos_threshold=4.0)

General ASR

ASR is more robust to mild quality degradation than TTS. Default works well:

1UTMOSFilterStage(mos_threshold=3.5)

Web-Scraped Audio (Permissive)

Web crawls often have systematically lower audio quality. Lowering the threshold preserves more data; pair with stricter SIGMOS thresholds for targeted dimensions:

1UTMOSFilterStage(mos_threshold=3.0)
2# Then SIGMOSFilterStage(noise_threshold=4.0, ovrl_threshold=3.0) downstream

Complete UTMOS Pipeline Example

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
4from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
5from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
6from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
7from nemo_curator.stages.text.io.writer import JsonlWriter
8
9pipeline = Pipeline(name="utmos_filtering")
10
11# 1. Normalize input
12pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))
13
14# 2. Segment into speech chunks
15pipeline.add_stage(VADSegmentationStage(min_duration_sec=2.0))
16
17# 3. Filter by UTMOS (drop MOS < 3.5)
18pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))
19
20# 4. Export filtered manifest
21pipeline.add_stage(AudioToDocumentStage())
22pipeline.add_stage(JsonlWriter(path="./utmos_filtered"))
23
24executor = XennaExecutor()
25pipeline.run(executor)

Best Practices

  • Inspect before filtering: always run with mos_threshold=None first on a representative sample. Pick the threshold from the actual distribution, not from the table above.
  • Stack UTMOS before SIGMOS: UTMOS is cheaper than SIGMOS (single score vs seven dimensions). Run UTMOS first as a coarse cut, then SIGMOS for fine-grained dimension filtering.
  • Match threshold to downstream model: TTS (4.0+), ASR (3.5), permissive curation (3.0). The expected use of the data dictates the threshold.
  • Don’t change sample_rate: the UTMOS model is trained at 16 kHz. Override only with a custom checkpoint trained at a different rate.