Curate AudioProcess DataQuality Filtering

SIGMOS Filter

View as Markdown

Filter audio segments using SIGMOS (Signal-based Mean Opinion Score) — a multi-dimensional perceptual-quality model that produces seven independent scores per audio clip. Unlike UTMOS (a single composite MOS), SIGMOS lets you target specific kinds of degradation independently.

Understanding SIGMOS

The Seven Quality Dimensions

Each dimension is independently configurable on a 0.0–5.0 scale (higher = better). Setting any threshold to None disables that dimension; a segment passes only if all active thresholds are met.

DimensionFieldThreshold ParamWhat it Measures
Noisesigmos_noisenoise_thresholdBackground noise floor (higher score = quieter background).
Overallsigmos_ovrlovrl_thresholdAggregate quality, similar to UTMOS but on a different scale.
Signalsigmos_sigsig_thresholdCleanliness of the speech signal itself.
Colorationsigmos_colcol_thresholdSpectral coloration / EQ artifacts (e.g., telephony narrowing).
Discontinuitysigmos_discdisc_thresholdGlitches, dropouts, click and pop artifacts.
Loudnesssigmos_loudloud_thresholdPerceived loudness consistency.
Reverbsigmos_reverbreverb_thresholdReverberation amount (higher = drier, less echoey).

Threshold Guidelines

The table below provides starting points; tune by inspecting per-dimension distributions on your data.

DimensionPermissiveDefaultStrict
noise_threshold3.54.04.5
ovrl_threshold3.03.54.0
sig_thresholdNoneNone3.5
col_thresholdNoneNone3.0
disc_thresholdNoneNone4.0
loud_thresholdNoneNone3.0
reverb_thresholdNoneNone3.0

The default configuration only enables noise_threshold=4.0 and ovrl_threshold=3.5. Activate additional dimensions only when targeted at a specific failure mode in your data.

When to Use SIGMOS vs UTMOS

  • UTMOS is single-score, fast, and a good first cut.
  • SIGMOS is multi-dimensional and lets you keep audio with one kind of acceptable degradation while rejecting another. Use SIGMOS when you need to enforce specific quality requirements (e.g., “no reverb” or “no click artifacts”) that a single MOS score can’t express.

Basic SIGMOS Filtering

Step 1: Score the Dataset

Run SIGMOS in score-only mode by leaving every threshold at the default (None for the disabled ones; defaults already active are noise=4.0, ovrl=3.5). To capture all seven dimensions for analysis, disable filtering by setting active defaults to None:

1from nemo_curator.stages.audio.filtering.sigmos import SIGMOSFilterStage
2
3# Score all dimensions without filtering
4sigmos = SIGMOSFilterStage(noise_threshold=None, ovrl_threshold=None)
5pipeline.add_stage(sigmos)

Each output AudioTask will carry seven new fields (sigmos_noise, sigmos_ovrl, etc.) regardless of which thresholds are active.

Step 2: Inspect Per-Dimension Distributions

Export the scored manifest and inspect distributions per dimension:

1import pandas as pd
2
3df = pd.read_json("./scored.jsonl", lines=True)
4
5for dim in ["sigmos_noise", "sigmos_ovrl", "sigmos_sig", "sigmos_col",
6 "sigmos_disc", "sigmos_loud", "sigmos_reverb"]:
7 print(dim, df[dim].quantile([0.1, 0.5, 0.9]).values)

Use the percentiles to choose thresholds — for example, set noise_threshold at the 25th percentile to drop the bottom quarter of the data on noise.

Step 3: Apply Tuned Thresholds

1sigmos = SIGMOSFilterStage(
2 noise_threshold=4.0, # Reject noisy audio
3 ovrl_threshold=3.5, # Aggregate quality floor
4 reverb_threshold=3.0, # Reject heavily reverberant audio
5)
6pipeline.add_stage(sigmos)

A segment is dropped if any active threshold fails. Setting any threshold to None disables that dimension.

Parameters

ParameterTypeDefaultDescription
model_dirstr(cached path)Directory used to download the SIGMOS ONNX model on first use.
model_pathstr | NoneNoneDirect path to a local SIGMOS .onnx file. Overrides model_dir when set.
noise_thresholdfloat | None4.0Minimum noise score; None disables.
ovrl_thresholdfloat | None3.5Minimum overall score; None disables.
sig_thresholdfloat | NoneNoneMinimum signal score; None disables.
col_thresholdfloat | NoneNoneMinimum coloration score; None disables.
disc_thresholdfloat | NoneNoneMinimum discontinuity score; None disables.
loud_thresholdfloat | NoneNoneMinimum loudness score; None disables.
reverb_thresholdfloat | NoneNoneMinimum reverb score; None disables.

The default resource allocation is Resources(cpus=1.0, gpus=0.5).

Domain-Specific Tuning

Voice Cloning / TTS

TTS training is sensitive to noise, reverb, and clipping. Activate the relevant dimensions:

1SIGMOSFilterStage(
2 noise_threshold=4.5,
3 ovrl_threshold=4.0,
4 reverb_threshold=3.5,
5 disc_threshold=4.0, # No clicks or dropouts
6)

Far-Field / Conference Audio

Far-field recordings have heavy reverb and variable noise. Loosen reverb but tighten signal cleanliness:

1SIGMOSFilterStage(
2 noise_threshold=3.5, # accept some noise
3 sig_threshold=3.5, # but the speech itself must be clean
4 reverb_threshold=2.5, # reverb expected; only reject extreme cases
5)

Web-Scraped Audio

Web audio is heterogeneous. Start permissive and tighten dimensions one at a time after inspecting failure modes:

1SIGMOSFilterStage(
2 noise_threshold=3.5,
3 ovrl_threshold=3.0,
4)

Complete SIGMOS Pipeline Example

A pipeline that stacks UTMOS (cheap) and SIGMOS (fine-grained):

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.backends.xenna import XennaExecutor
3from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
4from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
5from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
6from nemo_curator.stages.audio.filtering.sigmos import SIGMOSFilterStage
7from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
8from nemo_curator.stages.text.io.writer import JsonlWriter
9
10pipeline = Pipeline(name="sigmos_filtering")
11
12pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))
13pipeline.add_stage(VADSegmentationStage(min_duration_sec=2.0))
14
15# Coarse cut with UTMOS first
16pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))
17
18# Fine-grained dimension filtering
19pipeline.add_stage(
20 SIGMOSFilterStage(
21 noise_threshold=4.0,
22 ovrl_threshold=3.5,
23 reverb_threshold=3.0,
24 )
25)
26
27pipeline.add_stage(AudioToDocumentStage())
28pipeline.add_stage(JsonlWriter(path="./curated_audio"))
29
30executor = XennaExecutor()
31pipeline.run(executor)

Best Practices

  • Score before filtering: SIGMOS is more expensive than UTMOS, so always run with all thresholds disabled first to inspect distributions before committing to thresholds.
  • Activate one dimension at a time: enabling all seven thresholds aggressively will leave very little data. Activate one or two relevant dimensions, then add more if specific failure modes survive.
  • Stack UTMOS first: run UTMOS as a cheap upstream cut to drop obviously-bad segments before paying for SIGMOS scoring.
  • Match the dimension to the use case: don’t enforce reverb thresholds on data captured in a hall; don’t enforce noise thresholds on field recordings if mild noise is acceptable.