Filter audio segments based on their predicted Mean Opinion Score (MOS) using the utmos22_strong model. UTMOS is the primary perceptual-quality predictor in the audio quality-filtering pipeline.
Mean Opinion Score is a 1.0–5.0 perceptual-quality scale originally defined for human listening tests. UTMOS is a learned no-reference predictor that produces an MOS estimate directly from waveform input — no clean reference signal required, unlike PESQ or POLQA.
A common starting point is mos_threshold=3.5 — drops obviously distorted, noisy, or clipped audio while keeping most usable training material.
In a typical pipeline both are stacked: UTMOS first as a coarse cut, SIGMOS second to enforce specific quality requirements.
The stage accepts either an in-memory waveform (waveform + sample_rate) or a path (audio_filepath). Multi-channel input is automatically converted to mono, and any sample rate is resampled to 16 kHz before scoring.
For unfamiliar datasets, run UTMOS in score-only mode first by setting mos_threshold=None:
Export the resulting manifest with AudioToDocumentStage + JsonlWriter, then plot the utmos_mos distribution (in pandas, numpy, or your preferred tool) before choosing a real threshold. This avoids over-filtering datasets that are systematically lower-quality than UTMOS’s training distribution.
Segments with predicted MOS below mos_threshold are dropped; segments at or above the threshold pass through unchanged.
The default resource allocation is Resources(cpus=1.0, gpus=0.5). UTMOS is small; fractional-GPU allocation lets it share a device with other inference stages.
torch.hub from tarepan/SpeechMOS:v1.2.0 on first use.torch.hub access is unavailable, the stage logs the error and passes the input through unchanged. Pre-cache the model in an air-gapped environment by setting the TORCH_HOME environment variable.MonoConversionStage solely for UTMOS.TTS training quality is sensitive to background noise, breath sounds, and clipping. Use a strict threshold:
ASR is more robust to mild quality degradation than TTS. Default works well:
Web crawls often have systematically lower audio quality. Lowering the threshold preserves more data; pair with stricter SIGMOS thresholds for targeted dimensions:
mos_threshold=None first on a representative sample. Pick the threshold from the actual distribution, not from the table above.sample_rate: the UTMOS model is trained at 16 kHz. Override only with a custom checkpoint trained at a different rate.AudioDataFilterStage Composite — bundles UTMOS into the standard pipeline.