nemo_curator.stages.audio.segmentation.speaker_separation

View as Markdown

Speaker separation stage using NeMo SortFormer diarization model.

Performs speaker diarization and separates audio by speaker, creating separate AudioTask outputs for each speaker.

Module Contents

Classes

NameDescription
SpeakerSeparationStageSpeaker separation stage using NeMo SortFormer diarization model.

Functions

NameDescription
_pydub_to_waveform_srConvert PyDub AudioSegment to (waveform, sample_rate). Output is canonical format only.

API

class nemo_curator.stages.audio.segmentation.speaker_separation.SpeakerSeparationStage(
model_path: str = 'nvidia/diar_sortformer_4sp...,
exclude_overlaps: bool = True,
min_duration: float = 0.8,
gap_threshold: float = 0.1,
buffer_time: float = 0.5,
name: str = 'SpeakerSeparation',
batch_size: int = 1,
resources: nemo_curator.stages.resources.Resources = (lambda: Resources(cpus=1.0...
)
Dataclass

Bases: ProcessingStage[AudioTask, AudioTask]

Speaker separation stage using NeMo SortFormer diarization model.

Separates audio by speaker and creates separate AudioTask outputs for each speaker’s segments. Downloads the NeMo model from HuggingFace Hub (nvidia/diar_sortformer_4spk-v1).

Parameters:

model_path
strDefaults to 'nvidia/diar_sortformer_4spk-v1'

HuggingFace model ID or path to NeMo diarization model

exclude_overlaps
boolDefaults to True

Whether to exclude overlapping speaker regions

min_duration
floatDefaults to 0.8

Minimum segment duration in seconds

gap_threshold
floatDefaults to 0.1

Gap threshold for merging speaker segments

buffer_time
floatDefaults to 0.5

Buffer time around speaker segments

_INHERITED_DROP_KEYS
batch_size
int = 1
buffer_time
float = 0.5
exclude_overlaps
bool = True
gap_threshold
float = 0.1
min_duration
float = 0.8
model_path
str = 'nvidia/diar_sortformer_4spk-v1'
name
str = 'SpeakerSeparation'
resources
Resources
nemo_curator.stages.audio.segmentation.speaker_separation.SpeakerSeparationStage.__post_init__()
nemo_curator.stages.audio.segmentation.speaker_separation.SpeakerSeparationStage._build_speaker_tasks(
speaker_audio_data: dict,
item: dict,
task: nemo_curator.tasks.AudioTask
) -> list[nemo_curator.tasks.AudioTask]

Build AudioTask list from speaker audio data.

nemo_curator.stages.audio.segmentation.speaker_separation.SpeakerSeparationStage._check_gpu_availability(
gpus: float
) -> None
staticmethod
nemo_curator.stages.audio.segmentation.speaker_separation.SpeakerSeparationStage._initialize_separator() -> None
nemo_curator.stages.audio.segmentation.speaker_separation.SpeakerSeparationStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.segmentation.speaker_separation.SpeakerSeparationStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.audio.segmentation.speaker_separation.SpeakerSeparationStage.process(
task: nemo_curator.tasks.AudioTask
) -> list[nemo_curator.tasks.AudioTask]

Separate audio by speaker.

Returns: list[AudioTask]

List of AudioTask objects, one per speaker.

nemo_curator.stages.audio.segmentation.speaker_separation.SpeakerSeparationStage.ray_stage_spec() -> dict[str, typing.Any]
nemo_curator.stages.audio.segmentation.speaker_separation.SpeakerSeparationStage.setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.audio.segmentation.speaker_separation.SpeakerSeparationStage.setup_on_node(
_node_info: typing.Any = None,
_worker_metadata: typing.Any = None
) -> None
nemo_curator.stages.audio.segmentation.speaker_separation.SpeakerSeparationStage.teardown() -> None
nemo_curator.stages.audio.segmentation.speaker_separation._pydub_to_waveform_sr(
seg: pydub.AudioSegment
) -> tuple[torch.Tensor, int]

Convert PyDub AudioSegment to (waveform, sample_rate). Output is canonical format only.