NeMo Speaker Diarization API

Model Classes

class nemo.collections.asr.models.ClusteringDiarizer(*args: Any, **kwargs: Any)

Bases: torch.nn.Module, nemo.core.classes.common.Model, nemo.collections.asr.parts.mixins.mixins.DiarizationMixin

Inference model Class for offline speaker diarization. This class handles required functionality for diarization : Speech Activity Detection, Segmentation, Extract Embeddings, Clustering, Resegmentation and Scoring. All the parameters are passed through config file

diarize(paths2audio_files: Optional[List[str]] = None, batch_size: int = 0): Diarize files provided through paths2audio_files or manifest file input: paths2audio_files (List[str]): list of paths to file containing audio file batch_size (int): batch_size considered for extraction of speaker embeddings and VAD computation

classmethod list_available_models()

Should list all pre-trained models available via NVIDIA NGC cloud. Note: There is no check that requires model names and aliases to be unique. In the case of a collision, whatever model (or alias) is listed first in the this returned list will be instantiated.

Returns: A list of PretrainedModelInfo entries

classmethod restore_from(restore_path: str, override_config_path: Optional[str] = None, map_location: Optional[torch.device] = None, strict: bool = False)

Restores model instance (weights and configuration) from a .nemo file

Parameters

restore_path – path to .nemo file from which model should be instantiated
override_config_path – path to a yaml config that will override the internal config file or an OmegaConf / DictConfig object representing the model config.
map_location – Optional torch.device() to map the instantiated model to a device. By default (None), it will select a GPU if available, falling back to CPU otherwise.
strict – Passed to load_state_dict. By default True
return_config – If set to true, will return just the underlying config of the restored model as an OmegaConf DictConfig object without instantiating the model.
trainer – An optional Trainer object, passed to the model constructor.
save_restore_connector – An optional SaveRestoreConnector object that defines the implementation of the restore_from() method.

save_to(save_path: str)

Saves model instance (weights and configuration) into EFF archive or .: You can use “restore_from” method to fully restore instance from .nemo file.
.nemo file is an archive (tar.gz) with the following:: model_config.yaml - model configuration in .yaml format. You can deserialize this into cfg argument for model’s constructor model_wights.chpt - model checkpoint

Parameters: save_path – Path to .nemo file where model instance should be saved

class nemo.collections.asr.models.EncDecDiarLabelModel(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT, nemo.collections.asr.models.asr_model.ExportableEncDecModel

Encoder decoder class for multiscale diarization decoder (MSDD). Model class creates training, validation methods for setting up data performing model forward pass.

This model class expects config dict for:

preprocessor
msdd_model
speaker_model

_init_segmentation_info(): Initialize segmentation settings: window, shift and multiscale weights.

_init_speaker_model(): Initialize speaker embedding model with model name or path passed through config. Note that speaker embedding model is loaded to self.msdd to enable multi-gpu and multi-node training. In addition, speaker embedding model is also saved with msdd model when .ckpt files are saved.

add_speaker_model_config(cfg)

Add config dictionary of the speaker model to the model’s config dictionary. This is required to save and load speaker model with MSDD model.

Parameters: cfg (DictConfig) – DictConfig type variable that conatains hyperparameters of MSDD model.

compute_accuracies()

Calculate F1 score and accuracy of the predicted sigmoid values.

Returns: F1 score of the estimated diarized speaker label sequences. simple_acc (float): Accuracy of predicted speaker labels: (total # of correct labels)/(total # of sigmoid values)
Return type: f1_score (float)

forward_infer(input_signal, input_signal_length, emb_vectors, targets): Wrapper function for inference case.

get_cluster_avg_embs_model(embs: torch.Tensor, clus_label_index: torch.Tensor, ms_seg_counts: torch.Tensor, scale_mapping) → torch.Tensor

Calculate the cluster-average speaker embedding based on the ground-truth speaker labels (i.e., cluster labels).

Parameters

embs (Tensor) – Merged embeddings without zero-padding in the batch. See ms_seg_counts for details. Shape: (Total number of segments in the batch, emb_dim)
clus_label_index (Tensor) – Merged ground-truth cluster labels from all scales with zero-padding. Each scale’s index can be retrieved by using segment index in ms_seg_counts. Shape: (batch_size, maximum total segment count among the samples in the batch)
ms_seg_counts (Tensor) –
Cumulative sum of the number of segments in each scale. This information is needed to reconstruct multi-scale input tensors during forward propagating.
Example: batch_size=3, scale_n=6, emb_dim=192
ms_seg_counts = [ [ 8, 9, 12, 16, 25, 51], [11, 13, 14, 17, 25, 51], [ 9, 9, 11, 16, 23, 50] ]
Counts of merged segments: (121, 131, 118) embs has shape of (370, 192) clus_label_index has shape of (3, 131)
Shape: (batch_size, scale_n)

Returns

Multi-scale cluster-average speaker embedding vectors. These embedding vectors are used as reference for each speaker to predict the speaker label for the given multi-scale embedding sequences. Shape: (batch_size, scale_n, emb_dim, self.num_spks_per_model)

Return type

ms_avg_embs (Tensor)

get_ms_emb_seq(embs: torch.Tensor, scale_mapping: torch.Tensor, ms_seg_counts: torch.Tensor) → torch.Tensor

Reshape the given tensor and organize the embedding sequence based on the original sequence counts. Repeat the embeddings according to the scale_mapping information so that the final embedding sequence has the identical length for all scales.

Parameters

embs (Tensor) – Merged embeddings without zero-padding in the batch. See ms_seg_counts for details. Shape: (Total number of segments in the batch, emb_dim)
scale_mapping (Tensor) –
The element at the m-th row and the n-th column of the scale mapping matrix indicates the (m+1)-th scale segment index which has the closest center distance with (n+1)-th segment in the base scale. .. rubric:: Example

scale_mapping_argmat[2][101] = 85

In the above example, it means that 86-th segment in the 3rd scale (python index is 2) is mapped with 102-th segment in the base scale. Thus, the longer segments bound to have more repeating numbers since multiple base scale segments (since the base scale has the shortest length) fall into the range of the longer segments. At the same time, each row contains N numbers of indices where N is number of segments in the base-scale (i.e., the finest scale). Shape: (batch_size, scale_n, self.diar_window_length)
ms_seg_counts (Tensor) –
Cumulative sum of the number of segments in each scale. This information is needed to reconstruct the multi-scale input matrix during forward propagating.

Example: batch_size=3, scale_n=6, emb_dim=192

ms_seg_counts =

[[8, 9, 12, 16, 25, 51],
[11, 13, 14, 17, 25, 51], [ 9, 9, 11, 16, 23, 50]]

In this function, ms_seg_counts is used to get the actual length of each embedding sequence without zero-padding.

Returns

Multi-scale embedding sequence that is mapped, matched and repeated. The longer scales are less repeated, while shorter scales are more frequently repeated following the scale mapping tensor.

Return type

ms_emb_seq (Tensor)

get_ms_mel_feat(processed_signal: torch.Tensor, processed_signal_len: torch.Tensor, ms_seg_timestamps: torch.Tensor, ms_seg_counts: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]

Load acoustic feature from audio segments for each scale and save it into a torch.tensor matrix. In addition, create variables containing the information of the multiscale subsegmentation information.

Note: self.emb_batch_size determines the number of embedding tensors attached to the computational graph. If self.emb_batch_size is greater than 0, speaker embedding models are simultaneosly trained. Due to the constrant of GPU memory size, only a subset of embedding tensors can be attached to the computational graph. By default, the graph-attached embeddings are selected randomly by torch.randperm. Default value of self.emb_batch_size is 0.

Parameters

processed_signal (Tensor) – Zero-padded Feature input. Shape: (batch_size, feat_dim, the longest feature sequence length)
processed_signal_len (Tensor) – The actual legnth of feature input without zero-padding. Shape: (batch_size,)
ms_seg_timestamps (Tensor) – Timestamps of the base-scale segments. Shape: (batch_size, scale_n, number of base-scale segments, self.num_spks_per_model)
ms_seg_counts (Tensor) – Cumulative sum of the number of segments in each scale. This information is needed to reconstruct the multi-scale input matrix during forward propagating. Shape: (batch_size, scale_n)

Returns

Feature input stream split into the same length.: Shape: (total number of segments, feat_dim, self.frame_per_sec * the-longest-scale-length)
ms_mel_feat_len (Tensor):: The actual length of feature without zero-padding. Shape: (total number of segments,)
seq_len (Tensor):: The length of the input embedding sequences. Shape: (total number of segments,)
detach_ids (tuple):: Tuple containing both detached embeding indices and attached embedding indices

Return type

ms_mel_feat (Tensor)

setup_test_data(test_data_config: Optional[Union[omegaconf.DictConfig, Dict]])

(Optionally) Setups data loader to be used in test

Parameters: test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_config: Optional[Union[omegaconf.DictConfig, Dict]])

Setups data loader to be used in training

Parameters: train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_layer_config: Optional[Union[omegaconf.DictConfig, Dict]])

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

Mixins

class nemo.collections.asr.parts.mixins.mixins.DiarizationMixin

Bases: nemo.collections.asr.parts.mixins.mixins.VerificationMixin

abstract diarize(paths2audio_files: List[str], batch_size: int = 1) → List[str]

Takes paths to audio files and returns speaker labels :param paths2audio_files: paths to audio fragment to be transcribed

Returns: Speaker labels