Diarize multi-speaker audio and fan out one task per detected speaker so that downstream stages can score each speaker’s audio independently. NeMo Curator ships two diarization stages built on NVIDIA’s SortFormer family. Both target up to 4 speakers per file; choose based on whether your workload is offline batch curation or streaming/online.
Diarization answers “who spoke when?” — it segments an audio stream into per-speaker regions, identifying that speaker A talks 0.0–3.5s, speaker B talks 3.5–7.0s, speaker A returns at 7.0–9.0s, and so on. The output is one AudioTask per speaker, each containing only that speaker’s audio.
This unlocks per-speaker filtering: pipelines can rerun VAD, UTMOS, SIGMOS, and the band filter separately on each speaker, dropping individual low-quality speakers without losing the rest of the recording.
For most curation pipelines, SpeakerSeparationStage (offline) is the right choice. Use the streaming variant only when you need bounded latency or RTTM output.
The stage produces a fan-out list of AudioTask objects, one per detected speaker, each carrying:
speaker_id — speaker identifier (0, 1, 2, …)num_speakers — total speakers found in this fileduration_sec — duration of this speaker’s audiowaveform — that speaker’s torch tensor with overlapping regions removed (when exclude_overlaps=True)GPU is required (Resources(cpus=1.0, gpus=1.0) by default).
Speaker overlap regions (where multiple speakers talk simultaneously) and short gaps between same-speaker turns affect output quality:
SpeakerSeparationStage ParametersThe streaming variant (InferenceSortformerStage) is purpose-built for two use cases:
For pure offline curation, SpeakerSeparationStage is faster and more accurate.
This stage does not fan out per speaker — instead it writes a diar_segments list onto the input AudioTask. Use it as a metadata-enriching stage; downstream code consumes the diar_segments field directly.
chunk_len controls latency vs accuracy:
Streaming-mode evaluation on CallHome-eng0 (139 files) at the default settings: 6.2% macro DER, 6.0% weighted DER at a 0.25-second collar.
InferenceSortformerStage ParametersDefault resource allocation: Resources(cpus=1.0, gpu_memory_gb=8.0).
A pipeline that diarizes, then runs per-speaker quality filters:
SpeakerSeparationStage is faster and more accurate than InferenceSortformerStage for batch curation.exclude_overlaps=False for training data: overlapping speech is hard for downstream models; only disable when explicitly preserving natural conversation.SegmentConcatenationStage and TimestampMapperStage are typically paired with speaker separation.AudioDataFilterStage Composite — bundles offline speaker separation with per-speaker filters into the standard pipeline.