UTMOS Filter
Filter audio segments based on their predicted Mean Opinion Score (MOS) using the utmos22_strong model. UTMOS is the primary perceptual-quality predictor in the audio quality-filtering pipeline.
Understanding UTMOS
What MOS Measures
Mean Opinion Score is a 1.0–5.0 perceptual-quality scale originally defined for human listening tests. UTMOS is a learned no-reference predictor that produces an MOS estimate directly from waveform input — no clean reference signal required, unlike PESQ or POLQA.
A common starting point is mos_threshold=3.5 — drops obviously distorted, noisy, or clipped audio while keeping most usable training material.
When to Use UTMOS vs SIGMOS
- UTMOS produces a single composite quality score. Use it as the first cheap filter to drop obviously-bad audio.
- SIGMOS produces seven independent dimension scores (noise, signal, reverb, etc.). Use it after UTMOS for fine-grained control over which kinds of degradation to allow.
In a typical pipeline both are stacked: UTMOS first as a coarse cut, SIGMOS second to enforce specific quality requirements.
Basic UTMOS Filtering
Step 1: Configure the Stage
The stage accepts either an in-memory waveform (waveform + sample_rate) or a path (audio_filepath). Multi-channel input is automatically converted to mono, and any sample rate is resampled to 16 kHz before scoring.
Step 2: Inspect the MOS Distribution Before Filtering
For unfamiliar datasets, run UTMOS in score-only mode first by setting mos_threshold=None:
Export the resulting manifest with AudioToDocumentStage + JsonlWriter, then plot the utmos_mos distribution (in pandas, numpy, or your preferred tool) before choosing a real threshold. This avoids over-filtering datasets that are systematically lower-quality than UTMOS’s training distribution.
Step 3: Apply the Tuned Threshold
Segments with predicted MOS below mos_threshold are dropped; segments at or above the threshold pass through unchanged.
Parameters
The default resource allocation is Resources(cpus=1.0, gpus=0.5). UTMOS is small; fractional-GPU allocation lets it share a device with other inference stages.
Behavior Notes
- Model fetch: the model is downloaded via
torch.hubfromtarepan/SpeechMOS:v1.2.0on first use. - Offline environments: if
torch.hubaccess is unavailable, the stage logs the error and passes the input through unchanged. Pre-cache the model in an air-gapped environment by setting theTORCH_HOMEenvironment variable. - Multi-channel handling: stereo and multi-channel input is converted to mono internally before scoring; you do not need to insert
MonoConversionStagesolely for UTMOS.
Domain-Specific Tuning
Voice Cloning / TTS
TTS training quality is sensitive to background noise, breath sounds, and clipping. Use a strict threshold:
General ASR
ASR is more robust to mild quality degradation than TTS. Default works well:
Web-Scraped Audio (Permissive)
Web crawls often have systematically lower audio quality. Lowering the threshold preserves more data; pair with stricter SIGMOS thresholds for targeted dimensions:
Complete UTMOS Pipeline Example
Best Practices
- Inspect before filtering: always run with
mos_threshold=Nonefirst on a representative sample. Pick the threshold from the actual distribution, not from the table above. - Stack UTMOS before SIGMOS: UTMOS is cheaper than SIGMOS (single score vs seven dimensions). Run UTMOS first as a coarse cut, then SIGMOS for fine-grained dimension filtering.
- Match threshold to downstream model: TTS (4.0+), ASR (3.5), permissive curation (3.0). The expected use of the data dictates the threshold.
- Don’t change
sample_rate: the UTMOS model is trained at 16 kHz. Override only with a custom checkpoint trained at a different rate.
Related Topics
- SIGMOS Filter — independent perceptual-quality dimensions; commonly stacked after UTMOS.
- VAD Segmentation — typical upstream stage producing the segments UTMOS scores.
AudioDataFilterStageComposite — bundles UTMOS into the standard pipeline.