nemo_curator.stages.audio.inference.sortformer

Module Contents

Classes

Name	Description
`InferenceSortformerStage`	Speaker diarization inference using Streaming Sortformer (NeMo).

Functions

Name	Description
`_parse_sortformer_segments`	Convert Sortformer output segments to list of {start, end, speaker} dicts.
`_write_rttm`	Write diarization segments to an RTTM file.

API

class nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage(
    model_name: str = 'nvidia/diar_streaming_sort...,
    model_path: str | None = None,
    cache_dir: str | None = None,
    diar_model: typing.Any | None = None,
    filepath_key: str = 'audio_filepath',
    diar_segments_key: str = 'diar_segments',
    rttm_out_dir: str | None = None,
    chunk_len: int = 340,
    chunk_left_context: int = 1,
    chunk_right_context: int = 40,
    fifo_len: int = 40,
    spkcache_update_period: int = 300,
    spkcache_len: int = 188,
    inference_batch_size: int = 1,
    name: str = 'Sortformer_inference',
    batch_size: int = 1,
    resources: nemo_curator.stages.resources.Resources = (lambda: Resources(cpus=1.0...
)

Dataclass

Bases: ProcessingStage[AudioTask, AudioTask]

Speaker diarization inference using Streaming Sortformer (NeMo).

Uses the NeMo SortformerEncLabelModel for end-to-end neural speaker diarization with streaming support. See: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

Parameters:

model_name

strDefaults to 'nvidia/diar_streaming_sortformer_4spk-v2.1'

Hugging Face model id. Defaults to “nvidia/diar_streaming_sortformer_4spk-v2.1”.

model_path

str | NoneDefaults to None

Local path to a .nemo checkpoint file; if set, takes precedence over model_name.

cache_dir

str | NoneDefaults to None

Directory for caching downloaded model weights. Defaults to HF hub default.

diar_model

Any | NoneDefaults to None

Pre-loaded SortformerEncLabelModel; if provided, setup() is a no-op.

filepath_key

strDefaults to 'audio_filepath'

Key in data for path to audio file. Defaults to “audio_filepath”.

diar_segments_key

strDefaults to 'diar_segments'

Key in output data for diarization segments list. Defaults to “diar_segments”.

rttm_out_dir

str | NoneDefaults to None

Optional directory to write RTTM files. Defaults to None.

chunk_len

intDefaults to 340

Streaming chunk size in 80 ms frames. Defaults to 340 (~30.4 s latency).

chunk_left_context

intDefaults to 1

Left context frames. Defaults to 1.

chunk_right_context

intDefaults to 40

Right context frames. Defaults to 40.

fifo_len

intDefaults to 40

FIFO queue size in frames. Defaults to 40.

spkcache_update_period

intDefaults to 300

Speaker cache update period in frames. Defaults to 300.

spkcache_len

intDefaults to 188

Speaker cache size in frames. Defaults to 188.

inference_batch_size

intDefaults to 1

Batch size passed to diarize(). Defaults to 1.

name

strDefaults to 'Sortformer_inference'

Stage name. Defaults to “Sortformer_inference”.

batch_size

int = 1

cache_dir

str | None = None

chunk_left_context

int = 1

chunk_len

int = 340

chunk_right_context

int = 40

diar_model

Any | None = None

diar_segments_key

str = 'diar_segments'

fifo_len

int = 40

filepath_key

str = 'audio_filepath'

inference_batch_size

int = 1

model_name

str = 'nvidia/diar_streaming_sortformer_4spk-v2.1'

model_path

str | None = None

name

str = 'Sortformer_inference'

resources

Resources

rttm_out_dir

str | None = None

spkcache_len

int = 188

spkcache_update_period

int = 300

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage._configure_streaming() -> None

Apply streaming configuration to the loaded model.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage._extend_pos_enc_for_long_audio(
    max_len: int = 30000
) -> None

Extend RelPositionalEncoding buffer to handle long audio files.

NeMo’s streaming Sortformer initialises pos_enc sized for one chunk (~35 conformer frames). Files longer than a few seconds overflow it at inference time. extend_pe() is a NeMo method that resizes the buffer safely — it just isn’t called automatically. max_len=30000 covers ~1000 s at any subsampling.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage._resolve_model_path() -> str

Resolve the path to the .nemo checkpoint from the HF cache.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.diarize(
    audio_paths: list[str]
) -> list[list[dict[str, typing.Any]]]

Run Sortformer on a list of audio files.

Returns a list (one entry per file) of segment lists [{start, end, speaker}].

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.inputs() -> tuple[list[str], list[str]]

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.outputs() -> tuple[list[str], list[str]]

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.process(
    task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask

Run speaker diarization on the audio file in the task.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.setup(
    _worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Load Sortformer model from Hugging Face or a local .nemo file.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.setup_on_node(
    _node_info: nemo_curator.backends.base.NodeInfo | None = None,
    _worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Pre-download model weights on the node so workers load from cache.

nemo_curator.stages.audio.inference.sortformer._parse_sortformer_segments(
    raw_segments: list
) -> list[dict[str, typing.Any]]

Convert Sortformer output segments to list of {start, end, speaker} dicts.

Handles both string format (“start end speaker”) and objects with start/end/speaker attributes.

nemo_curator.stages.audio.inference.sortformer._write_rttm(
    segments: list[dict[str, typing.Any]],
    sess_name: str,
    rttm_out_dir: str
) -> None

Write diarization segments to an RTTM file.

Module Contents

Classes

Name	Description
`InferenceSortformerStage`	Speaker diarization inference using Streaming Sortformer (NeMo).

Functions

Name	Description
`_parse_sortformer_segments`	Convert Sortformer output segments to list of {start, end, speaker} dicts.
`_write_rttm`	Write diarization segments to an RTTM file.

API

class nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage(
    model_name: str = 'nvidia/diar_streaming_sort...,
    model_path: str | None = None,
    cache_dir: str | None = None,
    diar_model: typing.Any | None = None,
    filepath_key: str = 'audio_filepath',
    diar_segments_key: str = 'diar_segments',
    rttm_out_dir: str | None = None,
    chunk_len: int = 340,
    chunk_left_context: int = 1,
    chunk_right_context: int = 40,
    fifo_len: int = 40,
    spkcache_update_period: int = 300,
    spkcache_len: int = 188,
    inference_batch_size: int = 1,
    name: str = 'Sortformer_inference',
    batch_size: int = 1,
    resources: nemo_curator.stages.resources.Resources = (lambda: Resources(cpus=1.0...
)

Dataclass

Bases: ProcessingStage[AudioTask, AudioTask]

Speaker diarization inference using Streaming Sortformer (NeMo).

Uses the NeMo SortformerEncLabelModel for end-to-end neural speaker diarization with streaming support. See: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1

Parameters:

model_name

strDefaults to 'nvidia/diar_streaming_sortformer_4spk-v2.1'

Hugging Face model id. Defaults to “nvidia/diar_streaming_sortformer_4spk-v2.1”.

model_path

str | NoneDefaults to None

Local path to a .nemo checkpoint file; if set, takes precedence over model_name.

cache_dir

str | NoneDefaults to None

Directory for caching downloaded model weights. Defaults to HF hub default.

diar_model

Any | NoneDefaults to None

Pre-loaded SortformerEncLabelModel; if provided, setup() is a no-op.

filepath_key

strDefaults to 'audio_filepath'

Key in data for path to audio file. Defaults to “audio_filepath”.

diar_segments_key

strDefaults to 'diar_segments'

Key in output data for diarization segments list. Defaults to “diar_segments”.

rttm_out_dir

str | NoneDefaults to None

Optional directory to write RTTM files. Defaults to None.

chunk_len

intDefaults to 340

Streaming chunk size in 80 ms frames. Defaults to 340 (~30.4 s latency).

chunk_left_context

intDefaults to 1

Left context frames. Defaults to 1.

chunk_right_context

intDefaults to 40

Right context frames. Defaults to 40.

fifo_len

intDefaults to 40

FIFO queue size in frames. Defaults to 40.

spkcache_update_period

intDefaults to 300

Speaker cache update period in frames. Defaults to 300.

spkcache_len

intDefaults to 188

Speaker cache size in frames. Defaults to 188.

inference_batch_size

intDefaults to 1

Batch size passed to diarize(). Defaults to 1.

name

strDefaults to 'Sortformer_inference'

Stage name. Defaults to “Sortformer_inference”.

batch_size

int = 1

cache_dir

str | None = None

chunk_left_context

int = 1

chunk_len

int = 340

chunk_right_context

int = 40

diar_model

Any | None = None

diar_segments_key

str = 'diar_segments'

fifo_len

int = 40

filepath_key

str = 'audio_filepath'

inference_batch_size

int = 1

model_name

str = 'nvidia/diar_streaming_sortformer_4spk-v2.1'

model_path

str | None = None

name

str = 'Sortformer_inference'

resources

Resources

rttm_out_dir

str | None = None

spkcache_len

int = 188

spkcache_update_period

int = 300

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage._configure_streaming() -> None

Apply streaming configuration to the loaded model.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage._extend_pos_enc_for_long_audio(
    max_len: int = 30000
) -> None

Extend RelPositionalEncoding buffer to handle long audio files.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage._resolve_model_path() -> str

Resolve the path to the .nemo checkpoint from the HF cache.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.diarize(
    audio_paths: list[str]
) -> list[list[dict[str, typing.Any]]]

Run Sortformer on a list of audio files.

Returns a list (one entry per file) of segment lists [{start, end, speaker}].

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.inputs() -> tuple[list[str], list[str]]

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.outputs() -> tuple[list[str], list[str]]

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.process(
    task: nemo_curator.tasks.AudioTask
) -> nemo_curator.tasks.AudioTask

Run speaker diarization on the audio file in the task.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.setup(
    _worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Load Sortformer model from Hugging Face or a local .nemo file.

nemo_curator.stages.audio.inference.sortformer.InferenceSortformerStage.setup_on_node(
    _node_info: nemo_curator.backends.base.NodeInfo | None = None,
    _worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Pre-download model weights on the node so workers load from cache.

nemo_curator.stages.audio.inference.sortformer._parse_sortformer_segments(
    raw_segments: list
) -> list[dict[str, typing.Any]]

Convert Sortformer output segments to list of {start, end, speaker} dicts.

Handles both string format (“start end speaker”) and objects with start/end/speaker attributes.

nemo_curator.stages.audio.inference.sortformer._write_rttm(
    segments: list[dict[str, typing.Any]],
    sess_name: str,
    rttm_out_dir: str
) -> None

Write diarization segments to an RTTM file.