Filter audio segments using SIGMOS (Signal-based Mean Opinion Score) — a multi-dimensional perceptual-quality model that produces seven independent scores per audio clip. Unlike UTMOS (a single composite MOS), SIGMOS lets you target specific kinds of degradation independently.
Each dimension is independently configurable on a 0.0–5.0 scale (higher = better). Setting any threshold to None disables that dimension; a segment passes only if all active thresholds are met.
The table below provides starting points; tune by inspecting per-dimension distributions on your data.
The default configuration only enables noise_threshold=4.0 and ovrl_threshold=3.5. Activate additional dimensions only when targeted at a specific failure mode in your data.
Run SIGMOS in score-only mode by leaving every threshold at the default (None for the disabled ones; defaults already active are noise=4.0, ovrl=3.5). To capture all seven dimensions for analysis, disable filtering by setting active defaults to None:
Each output AudioTask will carry seven new fields (sigmos_noise, sigmos_ovrl, etc.) regardless of which thresholds are active.
Export the scored manifest and inspect distributions per dimension:
Use the percentiles to choose thresholds — for example, set noise_threshold at the 25th percentile to drop the bottom quarter of the data on noise.
A segment is dropped if any active threshold fails. Setting any threshold to None disables that dimension.
The default resource allocation is Resources(cpus=1.0, gpus=0.5).
TTS training is sensitive to noise, reverb, and clipping. Activate the relevant dimensions:
Far-field recordings have heavy reverb and variable noise. Loosen reverb but tighten signal cleanliness:
Web audio is heterogeneous. Start permissive and tighten dimensions one at a time after inspecting failure modes:
A pipeline that stacks UTMOS (cheap) and SIGMOS (fine-grained):
AudioDataFilterStage Composite — bundles SIGMOS with the standard defaults.