For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
          • Config
          • Core
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
            • Audio
              • Advanced Pipelines
              • Alm
              • Common
              • Datasets
              • Filtering
                • Band
                • Band Filter Module
                • Sigmos
                • Utmos
              • Inference
              • Io
              • Metrics
              • Postprocessing
              • Preprocessing
              • Segmentation
              • Tagging
            • Base
            • Client Partitioning
            • Deduplication
            • File Partitioning
            • Function Decorators
            • Image
            • Interleaved
            • Math
            • Resources
            • Synthetic
            • Text
            • Video
          • Tasks
          • Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Classes
  • Functions
  • Data
  • API
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorStagesAudioFiltering

nemo_curator.stages.audio.filtering.utmos

||View as Markdown|
Previous

nemo_curator.stages.audio.filtering.sigmos

Next

nemo_curator.stages.audio.inference

UTMOS (UTokyo-SaruLab MOS Prediction) filter stage.

Filters audio segments based on UTMOS predicted Mean Opinion Score. Uses the utmos22_strong model from tarepan/SpeechMOS via torch.hub.

Accepts in-memory (waveform + sample_rate) or audio_filepath input. Audio is resampled to 16 kHz internally for UTMOS inference.

Module Contents

Classes

NameDescription
UTMOSFilterStageUTMOS quality assessment filter stage.

Functions

NameDescription
_load_waveform_tensorExtract a mono waveform tensor (1, N) and sample_rate from an item.

Data

_UTMOS_ENTRYPOINT

_UTMOS_REPO

_UTMOS_TARGET_SR

API

class nemo_curator.stages.audio.filtering.utmos.UTMOSFilterStage(
mos_threshold: float | None = 3.5,
sample_rate: int = _UTMOS_TARGET_SR,
name: str = 'UTMOSFilter',
batch_size: int = 1,
resources: nemo_curator.stages.resources.Resources = (lambda: Resources(cpus=1.0...
)
Dataclass

Bases: ProcessingStage[AudioTask, AudioTask]

UTMOS quality assessment filter stage.

Filters audio segments based on the UTMOS predicted MOS score. The model (utmos22_strong) is loaded via torch.hub from tarepan/SpeechMOS. Audio is resampled to 16 kHz for inference.

Parameters:

mos_threshold
float | NoneDefaults to 3.5

Minimum MOS score to pass (None to disable)

sample_rate
intDefaults to _UTMOS_TARGET_SR

Target sample rate for UTMOS inference (default 16000)

batch_size
int = 1
mos_threshold
float | None = 3.5
name
str = 'UTMOSFilter'
resources
Resources
sample_rate
int = _UTMOS_TARGET_SR
nemo_curator.stages.audio.filtering.utmos.UTMOSFilterStage.__post_init__()
nemo_curator.stages.audio.filtering.utmos.UTMOSFilterStage._ensure_model() -> None
nemo_curator.stages.audio.filtering.utmos.UTMOSFilterStage._process_single(
task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask | None

Run UTMOS scoring on a single (non-nested) task.

nemo_curator.stages.audio.filtering.utmos.UTMOSFilterStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.filtering.utmos.UTMOSFilterStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.filtering.utmos.UTMOSFilterStage.process(
task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask | list[nemo_curator.tasks.AudioTask]) -> nemo_curator.tasks.AudioTask | list[nemo_curator.tasks.AudioTask]

Process a single AudioTask and filter by UTMOS MOS score.

When task.data contains a "segments" key (nested mode from VAD), each segment is evaluated individually and only survivors are kept.

nemo_curator.stages.audio.filtering.utmos.UTMOSFilterStage.setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.audio.filtering.utmos.UTMOSFilterStage.setup_on_node(
_node_info: nemo_curator.backends.base.NodeInfo | None = None,
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.audio.filtering.utmos.UTMOSFilterStage.teardown() -> None
nemo_curator.stages.audio.filtering.utmos._load_waveform_tensor(
item: dict[str, typing.Any],
task_id: str
) -> tuple[torch.Tensor, int] | None

Extract a mono waveform tensor (1, N) and sample_rate from an item.

Supports waveform (Tensor/ndarray) + sample_rate or audio_filepath. Returns None if unavailable.

nemo_curator.stages.audio.filtering.utmos._UTMOS_ENTRYPOINT = 'utmos22_strong'
nemo_curator.stages.audio.filtering.utmos._UTMOS_REPO = 'tarepan/SpeechMOS:v1.2.0'
nemo_curator.stages.audio.filtering.utmos._UTMOS_TARGET_SR = 16000