NeMo Speaker Diarization API#

Model Classes#

class nemo.collections.asr.models.ClusteringDiarizer( cfg: DictConfig | Any, speaker_model=None, )[source]#

Bases: Module, Model, DiarizationMixin

Inference model Class for offline speaker diarization. This class handles required functionality for diarization : Speech Activity Detection, Segmentation, Extract Embeddings, Clustering, Resegmentation and Scoring. All the parameters are passed through config file

diarize( paths2audio_files: List[str] = None, batch_size: int = 0, )[source]#: Diarize files provided through paths2audio_files or manifest file input: paths2audio_files (List[str]): list of paths to file containing audio file batch_size (int): batch_size considered for extraction of speaker embeddings and VAD computation

classmethod list_available_models()[source]#

Should list all pre-trained models available via NVIDIA NGC cloud. Note: There is no check that requires model names and aliases to be unique. In the case of a collision, whatever model (or alias) is listed first in the this returned list will be instantiated.

Returns:: A list of PretrainedModelInfo entries

classmethod restore_from( restore_path: str, override_config_path: str | None = None, map_location: device | None = None, )[source]#

Restores model instance (weights and configuration) from a .nemo file

Parameters:

restore_path – path to .nemo file from which model should be instantiated
override_config_path – path to a yaml config that will override the internal config file or an OmegaConf / DictConfig object representing the model config.
map_location – Optional torch.device() to map the instantiated model to a device. By default (None), it will select a GPU if available, falling back to CPU otherwise.
strict – Passed to load_state_dict. By default True
return_config – If set to true, will return just the underlying config of the restored model as an OmegaConf DictConfig object without instantiating the model.
trainer – An optional Trainer object, passed to the model constructor.
save_restore_connector – An optional SaveRestoreConnector object that defines the implementation of the restore_from() method.

save_to(save_path: str)[source]#

Saves model instance (weights and configuration) into EFF archive or .: You can use “restore_from” method to fully restore instance from .nemo file.
.nemo file is an archive (tar.gz) with the following:: model_config.yaml - model configuration in .yaml format. You can deserialize this into cfg argument for model’s constructor model_wights.chpt - model checkpoint

Parameters:: save_path – Path to .nemo file where model instance should be saved

property verbose: bool#

class nemo.collections.asr.models.EncDecDiarLabelModel( cfg: DictConfig, trainer: Trainer = None, )[source]#

Bases: ModelPT, ExportableEncDecModel

Encoder decoder class for multiscale diarization decoder (MSDD). Model class creates training, validation methods for setting up data performing model forward pass.

This model class expects config dict for:

preprocessor
msdd_model
speaker_model

add_speaker_model_config(cfg)[source]#

Add config dictionary of the speaker model to the model’s config dictionary. This is required to save and load speaker model with MSDD model.

Parameters:: cfg (DictConfig) – DictConfig type variable that conatains hyperparameters of MSDD model.

compute_accuracies()[source]#

Calculate F1 score and accuracy of the predicted sigmoid values.

Returns:: A tuple of (f1_score, simple_acc) where f1_score is the F1 score of the estimated diarized speaker label sequences and simple_acc is the accuracy computed as (total correct labels) / (total sigmoid values).

forward( features, feature_length, ms_seg_timestamps, ms_seg_counts, clus_label_index, scale_mapping, targets, )[source]#: Function to compute forward pass for training/validation.

forward_infer( input_signal, input_signal_length, emb_vectors, targets, )[source]#: Wrapper function for inference case. This forward_infer is only used during inference, where forward is used for training and validation.

get_cluster_avg_embs_model( embs: Tensor, clus_label_index: Tensor, ms_seg_counts: Tensor, scale_mapping, ) → Tensor[source]#

Calculate the cluster-average speaker embedding based on the ground-truth speaker labels (i.e., cluster labels).

Parameters:

embs (Tensor) – Merged embeddings without zero-padding in the batch. See ms_seg_counts for details. Shape: (Total number of segments in the batch, emb_dim)
clus_label_index (Tensor) – Merged ground-truth cluster labels from all scales with zero-padding. Each scale’s index can be retrieved by using segment index in ms_seg_counts. Shape: (batch_size, maximum total segment count among the samples in the batch)
ms_seg_counts (Tensor) –
Cumulative sum of the number of segments in each scale. This information is needed to reconstruct multi-scale input tensors during forward propagating.
Example: batch_size=3, scale_n=6, emb_dim=192
ms_seg_counts = [ [ 8, 9, 12, 16, 25, 51], [11, 13, 14, 17, 25, 51], [ 9, 9, 11, 16, 23, 50] ]
Counts of merged segments: (121, 131, 118) embs has shape of (370, 192) clus_label_index has shape of (3, 131)
Shape: (batch_size, scale_n)

Returns:

Multi-scale cluster-average speaker embedding vectors. These embedding vectors are used as reference for each speaker to predict the speaker label for the given multi-scale embedding sequences. Shape: (batch_size, scale_n, emb_dim, self.num_spks_per_model)

Return type:

ms_avg_embs (Tensor)

get_ms_emb_seq( embs: Tensor, scale_mapping: Tensor, ms_seg_counts: Tensor, ) → Tensor[source]#

Reshape the given tensor and organize the embedding sequence based on the original sequence counts. Repeat the embeddings according to the scale_mapping information so that the final embedding sequence has the identical length for all scales.

Parameters:

embs (Tensor) – Merged embeddings without zero-padding in the batch. See ms_seg_counts for details. Shape: (Total number of segments in the batch, emb_dim)
scale_mapping (Tensor) –
The element at the m-th row and the n-th column of the scale mapping matrix indicates the (m+1)-th scale segment index which has the closest center distance with (n+1)-th segment in the base scale.

Example

scale_mapping_argmat[2][101] = 85

In the above example, it means that 86-th segment in the 3rd scale (python index is 2) is mapped with 102-th segment in the base scale. Thus, the longer segments bound to have more repeating numbers since multiple base scale segments (since the base scale has the shortest length) fall into the range of the longer segments. At the same time, each row contains N numbers of indices where N is number of segments in the base-scale (i.e., the finest scale). Shape: (batch_size, scale_n, self.diar_window_length)
ms_seg_counts (Tensor) –
Cumulative sum of the number of segments in each scale. This information is needed to reconstruct the multi-scale input matrix during forward propagating.

Example: batch_size=3, scale_n=6, emb_dim=192

ms_seg_counts =

[[8, 9, 12, 16, 25, 51],
[11, 13, 14, 17, 25, 51], [ 9, 9, 11, 16, 23, 50]]

In this function, ms_seg_counts is used to get the actual length of each embedding sequence without zero-padding.

Returns:

Multi-scale embedding sequence that is mapped, matched and repeated. The longer scales are less repeated, while shorter scales are more frequently repeated following the scale mapping tensor.

Return type:

ms_emb_seq (Tensor)

get_ms_mel_feat( processed_signal: Tensor, processed_signal_len: Tensor, ms_seg_timestamps: Tensor, ms_seg_counts: Tensor, ) → Tuple[Tensor, Tensor, Tensor, Tuple[Tensor, Tensor]][source]#

Load acoustic feature from audio segments for each scale and save it into a torch.tensor matrix. In addition, create variables containing the information of the multiscale subsegmentation information.

Note: self.emb_batch_size determines the number of embedding tensors attached to the computational graph. If self.emb_batch_size is greater than 0, speaker embedding models are simultaneosly trained. Due to the constrant of GPU memory size, only a subset of embedding tensors can be attached to the computational graph. By default, the graph-attached embeddings are selected randomly by torch.randperm. Default value of self.emb_batch_size is 0.

Parameters:

processed_signal (Tensor) – Zero-padded Feature input. Shape: (batch_size, feat_dim, the longest feature sequence length)
processed_signal_len (Tensor) – The actual legnth of feature input without zero-padding. Shape: (batch_size,)
ms_seg_timestamps (Tensor) – Timestamps of the base-scale segments. Shape: (batch_size, scale_n, number of base-scale segments, self.num_spks_per_model)
ms_seg_counts (Tensor) – Cumulative sum of the number of segments in each scale. This information is needed to reconstruct the multi-scale input matrix during forward propagating. Shape: (batch_size, scale_n)

Returns:

Feature input stream split into the same length.: Shape: (total number of segments, feat_dim, self.frame_per_sec * the-longest-scale-length)
ms_mel_feat_len (Tensor):: The actual length of feature without zero-padding. Shape: (total number of segments,)
seq_len (Tensor):: The length of the input embedding sequences. Shape: (total number of segments,)
detach_ids (tuple):: Tuple containing both detached embeding indices and attached embedding indices

Return type:

ms_mel_feat (Tensor)

property input_types: Dict[str, NeuralType] | None#: Define these to enable input neural type checks

classmethod list_available_models() → List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns:: List of available pre-trained models.

multi_test_epoch_end( outputs: List[Dict[str, Tensor]], dataloader_idx: int = 0, )[source]#

Adds support for multiple test datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters:

outputs – Same as that provided by LightningModule.on_validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.

Returns:

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

multi_validation_epoch_end( outputs: list, dataloader_idx: int = 0, )[source]#

Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters:

outputs – Same as that provided by LightningModule.on_validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.

Returns:

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

property output_types: Dict[str, NeuralType]#: Define these to enable output neural type checks

setup_multiple_test_data(test_data_config)[source]#: MSDD does not use multiple_test_data template. This function is a placeholder for preventing error.

setup_test_data( test_data_config: DictConfig | Dict | None, )[source]#

(Optionally) Setups data loader to be used in test

Parameters:: test_data_layer_config – test data layer parameters.

Returns:

setup_training_data( train_data_config: DictConfig | Dict | None, )[source]#

Setups data loader to be used in training

Parameters:: train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data( val_data_layer_config: DictConfig | Dict | None, )[source]#

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

test_dataloader()[source]#: Get the test dataloader.

training_step(batch: list, batch_idx: int)[source]#: Function to compute training step.

validation_step( batch: list, batch_idx: int, dataloader_idx: int = 0, )[source]#: Function to compute validation step.

class nemo.collections.asr.models.SortformerEncLabelModel( cfg: DictConfig, trainer: Trainer = None, )[source]#

Bases: ModelPT, ExportableEncDecModel, SpkDiarizationMixin

Encoder class for Sortformer diarization model. Model class creates training, validation methods for setting up data performing model forward pass.

This model class expects config dict for:

preprocessor
Transformer Encoder
FastConformer Encoder
Sortformer Modules

add_rttms_mask_mats( rttms_mask_mats, device: device, )[source]#

Check if the rttms_mask_mats is empty then add it to the list

Parameters:: rttms_mask_mats (List[torch.Tensor]) – List of PyTorch tensors containing the rttms mask matrices.

diarize( audio: str | List[str] | ndarray | DataLoader, sample_rate: int | None = None, batch_size: int = 1, include_tensor_outputs: bool = False, postprocessing_yaml: str | None = None, num_workers: int = 0, verbose: bool = True, override_config: DiarizeConfig | None = None, ) → List[List[str]] | Tuple[List[List[str]], List[Tensor]][source]#

One-click runner function for diarization.

Parameters:

audio – (a single or list) of paths to audio files or path to a manifest file.
batch_size – (int) Batch size to use during inference. Bigger will result in better throughput performance but would use more memory.
include_tensor_outputs – (bool) Include raw speaker activity probabilities to the output. See Returns: for more details.
postprocessing_yaml – Optional(str) Path to .yaml file with postprocessing parameters.
num_workers – (int) Number of workers for DataLoader.
verbose – (bool) Whether to display tqdm progress bar.
override_config – (Optional[DiarizeConfig]) A config to override the default config.

Returns:

A list of lists of speech segments with a corresponding speaker index, in format “[begin_seconds, end_seconds, speaker_index]”. If include_tensor_outputs is True: A tuple of the above list and list of tensors of raw speaker activity probabilities.

Return type:

If include_tensor_outputs is False

forward(audio_signal, audio_signal_length)[source]#

Forward pass for training and inference.

Parameters:

audio_signal (torch.Tensor) – Tensor containing audio waveform Shape: (batch_size, num_samples)
audio_signal_length (torch.Tensor) – Tensor containing lengths of audio waveforms Shape: (batch_size,)

Returns:

Sorted tensor containing predicted speaker labels: Shape: (batch_size, max. diar frame count, num_speakers)

Return type:

preds (torch.Tensor)

forward_for_export( chunk, chunk_lengths, spkcache, spkcache_lengths, fifo, fifo_lengths, )[source]#

This forward pass is for ONNX model export.

Parameters:

chunk (torch.Tensor) – Tensor containing audio waveform. The term “chunk” refers to the “input buffer” in the speech processing pipeline. The size of chunk (input buffer) determines the latency introduced by buffering. Shape: (batch_size, feature frame count, dimension)
chunk_lengths (torch.Tensor) – Tensor containing lengths of audio waveforms Shape: (batch_size,)
spkcache (torch.Tensor) – Tensor containing speaker cache embeddings from start Shape: (batch_size, spkcache_len, emb_dim)
spkcache_lengths (torch.Tensor) – Tensor containing lengths of speaker cache Shape: (batch_size,)
fifo (torch.Tensor) – Tensor containing embeddings from latest chunks Shape: (batch_size, fifo_len, emb_dim)
fifo_lengths (torch.Tensor) – Tensor containing lengths of FIFO queue embeddings Shape: (batch_size,)

Returns:

Sorted tensor containing predicted speaker labels: Shape: (batch_size, max. diar frame count, num_speakers)
chunk_pre_encode_embs (torch.Tensor): Tensor containing pre-encoded embeddings from the chunk: Shape: (batch_size, num_frames, emb_dim)
chunk_pre_encode_lengths (torch.Tensor): Tensor containing lengths of pre-encoded embeddings: from the chunk (=input buffer). Shape: (batch_size,)

Return type:

spkcache_fifo_chunk_preds (torch.Tensor)

forward_infer(emb_seq, emb_seq_length)[source]#

The main forward pass for diarization for offline diarization inference.

Parameters:

emb_seq (torch.Tensor) – Tensor containing FastConformer encoder states (embedding vectors). Shape: (batch_size, diar_frame_count, emb_dim)
emb_seq_length (torch.Tensor) – Tensor containing lengths of FastConformer encoder states. Shape: (batch_size,)

Returns:

Sorted tensor containing Sigmoid values for predicted speaker labels.: Shape: (batch_size, diar_frame_count, num_speakers)

Return type:

preds (torch.Tensor)

forward_streaming( processed_signal, processed_signal_length, )[source]#

The main forward pass for diarization inference in streaming mode.

Parameters:

processed_signal (torch.Tensor) – Tensor containing audio waveform Shape: (batch_size, num_samples)
processed_signal_length (torch.Tensor) – Tensor containing lengths of audio waveforms Shape: (batch_size,)

Returns:

Tensor containing predicted speaker labels for the current chunk: and all previous chunks Shape: (batch_size, pred_len, num_speakers)

Return type:

total_preds (torch.Tensor)

forward_streaming_step( processed_signal, processed_signal_length, streaming_state, total_preds, drop_extra_pre_encoded=0, left_offset=0, right_offset=0, )[source]#

One-step forward pass for diarization inference in streaming mode.

Parameters:

processed_signal (torch.Tensor) – Tensor containing audio waveform Shape: (batch_size, num_samples)
processed_signal_length (torch.Tensor) – Tensor containing lengths of audio waveforms Shape: (batch_size,)
streaming_state (SortformerStreamingState) –

Tensor variables that contain the streaming state of the model.
Find more details in the SortformerStreamingState class in sortformer_modules.py.

spkcache#

Speaker cache to store embeddings from start

Type:

torch.Tensor

spkcache_lengths#

Lengths of the speaker cache

Type:

torch.Tensor

spkcache_preds#

The speaker predictions for the speaker cache parts

Type:

torch.Tensor

fifo#

FIFO queue to save the embedding from the latest chunks

Type:

torch.Tensor

fifo_lengths#

Lengths of the FIFO queue

Type:

torch.Tensor

fifo_preds#

The speaker predictions for the FIFO queue parts

Type:

torch.Tensor

spk_perm#

Speaker permutation information for the speaker cache

Type:

torch.Tensor
total_preds (torch.Tensor) – Tensor containing total predicted speaker activity probabilities Shape: (batch_size, cumulative pred length, num_speakers)
left_offset (int) – left offset for the current chunk
right_offset (int) – right offset for the current chunk

Returns:

Tensor variables that contain the updated streaming state of the model from: this function call.
total_preds (torch.Tensor):: Tensor containing the updated total predicted speaker activity probabilities. Shape: (batch_size, cumulative pred length, num_speakers)

Return type:

streaming_state (SortformerStreamingState)

frontend_encoder( processed_signal, processed_signal_length, bypass_pre_encode: bool = False, )[source]#

Generate encoder outputs from frontend encoder.

Parameters:

processed_signal (torch.Tensor) – tensor containing audio-feature (mel spectrogram, mfcc, etc.).
processed_signal_length (torch.Tensor) – tensor containing lengths of audio signal in integers.

Returns:

tensor containing encoder outputs. emb_seq_length (torch.Tensor): tensor containing lengths of encoder outputs.

Return type:

emb_seq (torch.Tensor)

property input_names#

property input_types: Dict[str, NeuralType] | None#: Define these to enable input neural type checks

classmethod list_available_models() → List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns:: List of available pre-trained models.

multi_validation_epoch_end( outputs: list, dataloader_idx: int = 0, )[source]#

Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters:

outputs – Same as that provided by LightningModule.on_validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.

Returns:

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

on_validation_epoch_end() → dict[str, dict[str, Tensor]] | None[source]#: Run validation with sync_dist=True.

oom_safe_feature_extraction( input_signal, input_signal_length, )[source]#

This function divides the input signal into smaller sub-batches and processes them sequentially to prevent out-of-memory errors during feature extraction.

Parameters:

input_signal (torch.Tensor) – The input audio signal.
input_signal_length (torch.Tensor) – The lengths of the input audio signals.

Returns:

A tuple of (processed_signal, processed_signal_length) where processed_signal is the aggregated audio signal tensor (length matches original batch size) and processed_signal_length contains the lengths of the processed signals.

property output_names#

property output_types: Dict[str, NeuralType]#: Define these to enable output neural type checks

process_signal( audio_signal, audio_signal_length, )[source]#

Extract audio features from time-series signal for further processing in the model.

This function performs the following steps: 1. Moves the audio signal to the correct device. 2. Normalizes the time-series audio signal. 3. Extrac audio feature from from the time-series audio signal using the model’s preprocessor.

Parameters:

audio_signal (torch.Tensor) – The input audio signal. Shape: (batch_size, num_samples)
audio_signal_length (torch.Tensor) – The length of each audio signal in the batch. Shape: (batch_size,)

Returns:

The preprocessed audio signal.: Shape: (batch_size, num_features, num_frames)
processed_signal_length (torch.Tensor): The length of each processed signal.: Shape: (batch_size,)

Return type:

processed_signal (torch.Tensor)

setup_test_data( test_data_config: DictConfig | Dict | None, )[source]#

(Optionally) Setups data loader to be used in test

Parameters:: test_data_layer_config – test data layer parameters.

Returns:

setup_training_data( train_data_config: DictConfig | Dict | None, )[source]#

Setups data loader to be used in training

Parameters:: train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data( val_data_layer_config: DictConfig | Dict | None, )[source]#

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

streaming_export(output: str)[source]#: Exports the model for streaming inference.

streaming_input_examples()[source]#: Input tensor examples for exporting streaming version of model

test_batch()[source]#

Perform batch testing on the model.

This method iterates through the test data loader, making predictions for each batch, and calculates various evaluation metrics. It handles both single and multi-sample batches.

test_dataloader()[source]#: Get the test dataloader.

test_step( batch: list, batch_idx: int, dataloader_idx: int = 0, )[source]#

Performs a single validation step.

This method processes a batch of data during the validation phase. It forward passes the audio signal through the model, computes various validation metrics, and stores these metrics for later aggregation.

Parameters:

batch (list) – A list containing the following elements: - audio_signal (torch.Tensor): The input audio signal. - audio_signal_length (torch.Tensor): The length of each audio signal in the batch. - targets (torch.Tensor): The target labels for the batch. - target_lens (torch.Tensor): The length of each target sequence in the batch.
batch_idx (int) – The index of the current batch.
dataloader_idx (int, optional) – The index of the dataloader in case of multiple validation dataloaders. Defaults to 0.

Returns:

A dictionary containing various validation metrics for this batch.

Return type:

dict

training_step( batch: list, batch_idx: int, ) → dict[source]#

Performs a single training step.

Parameters:

batch (list) – A list containing the following elements: - audio_signal (torch.Tensor): The input audio signal in time-series format. - audio_signal_length (torch.Tensor): The length of each audio signal in the batch. - targets (torch.Tensor): The target labels for the batch. - target_lens (torch.Tensor): The length of each target sequence in the batch.
batch_idx (int) – The index of the current batch.

Returns:

A dictionary containing the ‘loss’ key with the calculated loss value.

Return type:

(dict)

validation_step( batch: list, batch_idx: int, dataloader_idx: int = 0, )[source]#

Performs a single validation step.

Parameters:

batch (list) – A list containing the following elements: - audio_signal (torch.Tensor): The input audio signal. - audio_signal_length (torch.Tensor): The length of each audio signal in the batch. - targets (torch.Tensor): The target labels for the batch. - target_lens (torch.Tensor): The length of each target sequence in the batch.
batch_idx (int) – The index of the current batch.
dataloader_idx (int, optional) – The index of the dataloader in case of multiple validation dataloaders. Defaults to 0.

Returns:

A dictionary containing various validation metrics for this batch.

Return type:

dict

Mixins#

class nemo.collections.asr.parts.mixins.DiarizationMixin[source]#

Bases: VerificationMixin

abstractmethod diarize( paths2audio_files: List[str], batch_size: int = 1, ) → List[str][source]#

Takes paths to audio files and returns speaker labels :param paths2audio_files: paths to audio fragment to be transcribed

Returns:: Speaker labels

class nemo.collections.asr.parts.mixins.diarization.SpkDiarizationMixin[source]#

Bases: ABC

An abstract class for diarize-able models.

Creates a template function diarize() that provides an interface to perform diarization of audio tensors or filepaths.

diarize(

audio: str | List[str] | ndarray | List[ndarray] | DataLoader,

sample_rate: int | None = None,

batch_size: int = 1,

include_tensor_outputs: bool = False,

postprocessing_yaml: str | None = None,

num_workers: int = 1,

verbose: bool = False,

override_config: DiarizeConfig | None = None,

**config_kwargs,

) → List[Any] | List[List[Any]] | Tuple[Any] | Tuple[List[Any]][source]#: Takes paths to audio files and returns speaker labels

diarize_generator( audio: str | List[str] | ndarray | List[ndarray] | DataLoader, override_config: DiarizeConfig | None, )[source]#: A generator version of diarize function.