For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
          • Config
          • Core
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
            • Audio
              • Advanced Pipelines
              • Alm
              • Common
              • Datasets
              • Filtering
              • Inference
                • Asr
                • Sortformer
                • Speaker Diarization
                • Vad
              • Io
              • Metrics
              • Postprocessing
              • Preprocessing
              • Segmentation
              • Tagging
            • Base
            • Client Partitioning
            • Deduplication
            • File Partitioning
            • Function Decorators
            • Image
            • Interleaved
            • Math
            • Resources
            • Synthetic
            • Text
            • Video
          • Tasks
          • Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Classes
  • Functions
  • API
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorStagesAudioInference

nemo_curator.stages.audio.inference.sortformer

||View as Markdown|
Previous

nemo_curator.stages.audio.inference.asr.asr_nemo

Next

nemo_curator.stages.audio.inference.speaker_diarization

Module Contents

Classes

NameDescription
InferenceSortformerStageSpeaker diarization inference using Streaming Sortformer (NeMo).

Functions

NameDescription
_parse_sortformer_segmentsConvert Sortformer output segments to list of {start, end, speaker} dicts.
_write_rttmWrite diarization segments to an RTTM file.

API

class nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage(
model_name: str = 'nvidia/diar_streaming_sort...,
model_path: str | None = None,
cache_dir: str | None = None,
diar_model: typing.Any | None = None,
filepath_key: str = 'audio_filepath',
diar_segments_key: str = 'diar_segments',
rttm_out_dir: str | None = None,
chunk_len: int = 340,
chunk_left_context: int = 1,
chunk_right_context: int = 40,
fifo_len: int = 40,
spkcache_update_period: int = 300,
spkcache_len: int = 188,
inference_batch_size: int = 1,
name: str = 'Sortformer_inference',
batch_size: int = 1,
resources: nemo_curator.stages.resources.Resources = (lambda: Resources(cpus=1.0...
)
Dataclass

Bases: ProcessingStage[AudioTask, AudioTask]

Speaker diarization inference using Streaming Sortformer (NeMo).

Uses the NeMo SortformerEncLabelModel for end-to-end neural speaker diarization with streaming support. See: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

Parameters:

model_name
strDefaults to 'nvidia/diar_streaming_sortformer_4spk-v2.1'

Hugging Face model id. Defaults to “nvidia/diar_streaming_sortformer_4spk-v2.1”.

model_path
str | NoneDefaults to None

Local path to a .nemo checkpoint file; if set, takes precedence over model_name.

cache_dir
str | NoneDefaults to None

Directory for caching downloaded model weights. Defaults to HF hub default.

diar_model
Any | NoneDefaults to None

Pre-loaded SortformerEncLabelModel; if provided, setup() is a no-op.

filepath_key
strDefaults to 'audio_filepath'

Key in data for path to audio file. Defaults to “audio_filepath”.

diar_segments_key
strDefaults to 'diar_segments'

Key in output data for diarization segments list. Defaults to “diar_segments”.

rttm_out_dir
str | NoneDefaults to None

Optional directory to write RTTM files. Defaults to None.

chunk_len
intDefaults to 340

Streaming chunk size in 80 ms frames. Defaults to 340 (~30.4 s latency).

chunk_left_context
intDefaults to 1

Left context frames. Defaults to 1.

chunk_right_context
intDefaults to 40

Right context frames. Defaults to 40.

fifo_len
intDefaults to 40

FIFO queue size in frames. Defaults to 40.

spkcache_update_period
intDefaults to 300

Speaker cache update period in frames. Defaults to 300.

spkcache_len
intDefaults to 188

Speaker cache size in frames. Defaults to 188.

inference_batch_size
intDefaults to 1

Batch size passed to diarize(). Defaults to 1.

name
strDefaults to 'Sortformer_inference'

Stage name. Defaults to “Sortformer_inference”.

batch_size
int = 1
cache_dir
str | None = None
chunk_left_context
int = 1
chunk_len
int = 340
chunk_right_context
int = 40
diar_model
Any | None = None
diar_segments_key
str = 'diar_segments'
fifo_len
int = 40
filepath_key
str = 'audio_filepath'
inference_batch_size
int = 1
model_name
str = 'nvidia/diar_streaming_sortformer_4spk-v2.1'
model_path
str | None = None
name
str = 'Sortformer_inference'
resources
Resources
rttm_out_dir
str | None = None
spkcache_len
int = 188
spkcache_update_period
int = 300
nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage._configure_streaming() -> None

Apply streaming configuration to the loaded model.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage._extend_pos_enc_for_long_audio(
max_len: int = 30000
) -> None

Extend RelPositionalEncoding buffer to handle long audio files.

NeMo’s streaming Sortformer initialises pos_enc sized for one chunk (~35 conformer frames). Files longer than a few seconds overflow it at inference time. extend_pe() is a NeMo method that resizes the buffer safely — it just isn’t called automatically. max_len=30000 covers ~1000 s at any subsampling.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage._resolve_model_path() -> str

Resolve the path to the .nemo checkpoint from the HF cache.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.diarize(
audio_paths: list[str]
) -> list[list[dict[str, typing.Any]]]

Run Sortformer on a list of audio files.

Returns a list (one entry per file) of segment lists [{start, end, speaker}].

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.process(
task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask

Run speaker diarization on the audio file in the task.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.setup(
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Load Sortformer model from Hugging Face or a local .nemo file.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.setup_on_node(
_node_info: nemo_curator.backends.base.NodeInfo | None = None,
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Pre-download model weights on the node so workers load from cache.

nemo_curator.stages.audio.inference.sortformer._parse_sortformer_segments(
raw_segments: list
) -> list[dict[str, typing.Any]]

Convert Sortformer output segments to list of {start, end, speaker} dicts.

Handles both string format (“start end speaker”) and objects with start/end/speaker attributes.

nemo_curator.stages.audio.inference.sortformer._write_rttm(
segments: list[dict[str, typing.Any]],
sess_name: str,
rttm_out_dir: str
) -> None

Write diarization segments to an RTTM file.