nemo_curator.stages.audio.inference.sortformer
nemo_curator.stages.audio.inference.sortformer
nemo_curator.stages.audio.inference.sortformer
Bases: ProcessingStage[AudioTask, AudioTask]
Speaker diarization inference using Streaming Sortformer (NeMo).
Uses the NeMo SortformerEncLabelModel for end-to-end neural speaker diarization with streaming support. See: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1
Parameters:
Hugging Face model id. Defaults to “nvidia/diar_streaming_sortformer_4spk-v2.1”.
Local path to a .nemo checkpoint file; if set, takes precedence over model_name.
Directory for caching downloaded model weights. Defaults to HF hub default.
Pre-loaded SortformerEncLabelModel; if provided, setup() is a no-op.
Key in data for path to audio file. Defaults to “audio_filepath”.
Key in output data for diarization segments list. Defaults to “diar_segments”.
Optional directory to write RTTM files. Defaults to None.
Streaming chunk size in 80 ms frames. Defaults to 340 (~30.4 s latency).
Left context frames. Defaults to 1.
Right context frames. Defaults to 40.
FIFO queue size in frames. Defaults to 40.
Speaker cache update period in frames. Defaults to 300.
Speaker cache size in frames. Defaults to 188.
Batch size passed to diarize(). Defaults to 1.
Stage name. Defaults to “Sortformer_inference”.
Apply streaming configuration to the loaded model.
Extend RelPositionalEncoding buffer to handle long audio files.
NeMo’s streaming Sortformer initialises pos_enc sized for one chunk (~35 conformer frames). Files longer than a few seconds overflow it at inference time. extend_pe() is a NeMo method that resizes the buffer safely — it just isn’t called automatically. max_len=30000 covers ~1000 s at any subsampling.
Resolve the path to the .nemo checkpoint from the HF cache.
Run Sortformer on a list of audio files.
Returns a list (one entry per file) of segment lists [{start, end, speaker}].
Run speaker diarization on the audio file in the task.
Load Sortformer model from Hugging Face or a local .nemo file.
Pre-download model weights on the node so workers load from cache.
Convert Sortformer output segments to list of {start, end, speaker} dicts.
Handles both string format (“start end speaker”) and objects with start/end/speaker attributes.
Write diarization segments to an RTTM file.