NeMo Speaker Diarization API#
Model Classes#
- class nemo.collections.asr.models.ClusteringDiarizer(*args: Any, **kwargs: Any)[source]#
Bases:
Module
,Model
,DiarizationMixin
Inference model Class for offline speaker diarization. This class handles required functionality for diarization : Speech Activity Detection, Segmentation, Extract Embeddings, Clustering, Resegmentation and Scoring. All the parameters are passed through config file
- diarize(paths2audio_files: Optional[List[str]] = None, batch_size: int = 0)[source]#
Diarize files provided thorugh paths2audio_files or manifest file input: paths2audio_files (List[str]): list of paths to file containing audio file batch_size (int): batch_size considered for extraction of speaker embeddings and VAD computation
- classmethod list_available_models()[source]#
Should list all pre-trained models available via NVIDIA NGC cloud. Note: There is no check that requires model names and aliases to be unique. In the case of a collision, whatever model (or alias) is listed first in the this returned list will be instantiated.
- Returns
A list of PretrainedModelInfo entries
- classmethod restore_from(restore_path: str, override_config_path: Optional[str] = None, map_location: Optional[torch.device] = None, strict: bool = False)[source]#
Restores model instance (weights and configuration) from a .nemo file
- Parameters
restore_path – path to .nemo file from which model should be instantiated
override_config_path – path to a yaml config that will override the internal config file or an OmegaConf / DictConfig object representing the model config.
map_location – Optional torch.device() to map the instantiated model to a device. By default (None), it will select a GPU if available, falling back to CPU otherwise.
strict – Passed to load_state_dict. By default True
return_config – If set to true, will return just the underlying config of the restored model as an OmegaConf DictConfig object without instantiating the model.
trainer – An optional Trainer object, passed to the model constructor.
save_restore_connector – An optional SaveRestoreConnector object that defines the implementation of the restore_from() method.
- save_to(save_path: str)#
- Saves model instance (weights and configuration) into EFF archive or .
You can use “restore_from” method to fully restore instance from .nemo file.
- .nemo file is an archive (tar.gz) with the following:
model_config.yaml - model configuration in .yaml format. You can deserialize this into cfg argument for model’s constructor model_wights.chpt - model checkpoint
- Parameters
save_path – Path to .nemo file where model instance should be saved
- property verbose: bool#
- class nemo.collections.asr.models.EncDecDiarLabelModel(*args: Any, **kwargs: Any)[source]#
Bases:
ModelPT
,ExportableEncDecModel
Encoder decoder class for multiscale diarization decoder (MSDD). Model class creates training, validation methods for setting up data performing model forward pass.
- This model class expects config dict for:
preprocessor
msdd_model
speaker_model
- add_speaker_model_config(cfg)[source]#
Add config dictionary of the speaker model to the model’s config dictionary. This is required to save and load speaker model with MSDD model.
- Parameters
cfg (DictConfig) – DictConfig type variable that conatains hyperparameters of MSDD model.
- compute_accuracies()[source]#
Calculate F1 score and accuracy of the predicted sigmoid values.
- Returns
F1 score of the estimated diarized speaker label sequences. simple_acc (float):
Accuracy of predicted speaker labels: (total # of correct labels)/(total # of sigmoid values)
- Return type
f1_score (float)
- forward(features, feature_length, ms_seg_timestamps, ms_seg_counts, clus_label_index, scale_mapping, targets)[source]#
- forward_infer(input_signal, input_signal_length, emb_vectors, targets)[source]#
Wrapper function for inference case.
- get_cluster_avg_embs_model(embs: torch.Tensor, clus_label_index: torch.Tensor, ms_seg_counts: torch.Tensor, scale_mapping) torch.Tensor #
Calculate the cluster-average speaker embedding based on the ground-truth speaker labels (i.e., cluster labels).
- Parameters
embs (Tensor) – Merged embeddings without zero-padding in the batch. See ms_seg_counts for details. Shape: (Total number of segments in the batch, emb_dim)
clus_label_index (Tensor) – Merged ground-truth cluster labels from all scales with zero-padding. Each scale’s index can be retrieved by using segment index in ms_seg_counts. Shape: (batch_size, maximum total segment count among the samples in the batch)
ms_seg_counts (Tensor) –
Cumulative sum of the number of segments in each scale. This information is needed to reconstruct multi-scale input tensors during forward propagating.
- Example: batch_size=3, scale_n=6, emb_dim=192
- ms_seg_counts =
- [[8, 9, 12, 16, 25, 51],
[11, 13, 14, 17, 25, 51], [ 9, 9, 11, 16, 23, 50]]
Counts of merged segments: (121, 131, 118) embs has shape of (370, 192) clus_label_index has shape of (3, 131)
Shape: (batch_size, scale_n)
- Returns
Multi-scale cluster-average speaker embedding vectors. These embedding vectors are used as reference for each speaker to predict the speaker label for the given multi-scale embedding sequences. Shape: (batch_size, scale_n, emb_dim, self.num_spks_per_model)
- Return type
ms_avg_embs (Tensor)
- get_ms_emb_seq(embs: torch.Tensor, scale_mapping: torch.Tensor, ms_seg_counts: torch.Tensor) torch.Tensor [source]#
Reshape the given tensor and organize the embedding sequence based on the original sequence counts. Repeat the embeddings according to the scale_mapping information so that the final embedding sequence has the identical length for all scales.
- Parameters
embs (Tensor) – Merged embeddings without zero-padding in the batch. See ms_seg_counts for details. Shape: (Total number of segments in the batch, emb_dim)
scale_mapping (Tensor) –
The element at the m-th row and the n-th column of the scale mapping matrix indicates the (m+1)-th scale segment index which has the closest center distance with (n+1)-th segment in the base scale. .. rubric:: Example
scale_mapping_argmat[2][101] = 85
In the above example, it means that 86-th segment in the 3rd scale (python index is 2) is mapped with 102-th segment in the base scale. Thus, the longer segments bound to have more repeating numbers since multiple base scale segments (since the base scale has the shortest length) fall into the range of the longer segments. At the same time, each row contains N numbers of indices where N is number of segments in the base-scale (i.e., the finest scale). Shape: (batch_size, scale_n, self.diar_window_length)
ms_seg_counts (Tensor) –
Cumulative sum of the number of segments in each scale. This information is needed to reconstruct the multi-scale input matrix during forward propagating.
- Example: batch_size=3, scale_n=6, emb_dim=192
- ms_seg_counts =
- [[8, 9, 12, 16, 25, 51],
[11, 13, 14, 17, 25, 51], [ 9, 9, 11, 16, 23, 50]]
In this function, ms_seg_counts is used to get the actual length of each embedding sequence without zero-padding.
- Returns
Multi-scale embedding sequence that is mapped, matched and repeated. The longer scales are less repeated, while shorter scales are more frequently repeated following the scale mapping tensor.
- Return type
ms_emb_seq (Tensor)
- get_ms_mel_feat(processed_signal: torch.Tensor, processed_signal_len: torch.Tensor, ms_seg_timestamps: torch.Tensor, ms_seg_counts: torch.Tensor) Tuple[torch.Tensor, torch.Tensor, torch.Tensor, Tuple[torch.Tensor, torch.Tensor]] #
Load acoustic feature from audio segments for each scale and save it into a torch.tensor matrix. In addition, create variables containing the information of the multiscale subsegmentation information.
Note: self.emb_batch_size determines the number of embedding tensors attached to the computational graph. If self.emb_batch_size is greater than 0, speaker embedding models are simultaneosly trained. Due to the constrant of GPU memory size, only a subset of embedding tensors can be attached to the computational graph. By default, the graph-attached embeddings are selected randomly by torch.randperm. Default value of self.emb_batch_size is 0.
- Parameters
processed_signal (Tensor) – Zero-padded Feature input. Shape: (batch_size, feat_dim, the longest feature sequence length)
processed_signal_len (Tensor) – The actual legnth of feature input without zero-padding. Shape: (batch_size,)
ms_seg_timestamps (Tensor) – Timestamps of the base-scale segments. Shape: (batch_size, scale_n, number of base-scale segments, self.num_spks_per_model)
ms_seg_counts (Tensor) – Cumulative sum of the number of segments in each scale. This information is needed to reconstruct the multi-scale input matrix during forward propagating. Shape: (batch_size, scale_n)
- Returns
- Feature input stream split into the same length.
Shape: (total number of segments, feat_dim, self.frame_per_sec * the-longest-scale-length)
- ms_mel_feat_len (Tensor):
The actual length of feature without zero-padding. Shape: (total number of segments,)
- seq_len (Tensor):
The length of the input embedding sequences. Shape: (total number of segments,)
- detach_ids (tuple):
Tuple containing both detached embeding indices and attached embedding indices
- Return type
ms_mel_feat (Tensor)
- property input_types: Optional[Dict[str, NeuralType]]#
Define these to enable input neural type checks
- classmethod list_available_models() List[PretrainedModelInfo] [source]#
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.
- Returns
List of available pre-trained models.
- multi_test_epoch_end(outputs: List[Dict[str, torch.Tensor]], dataloader_idx: int = 0)[source]#
Adds support for multiple test datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.
- Parameters
outputs – Same as that provided by LightningModule.validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.
- Returns
A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.
- multi_validation_epoch_end(outputs: list, dataloader_idx: int = 0)[source]#
Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.
- Parameters
outputs – Same as that provided by LightningModule.validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.
- Returns
A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.
- property output_types: Dict[str, NeuralType]#
Define these to enable output neural type checks
- setup_multiple_test_data(test_data_config)[source]#
MSDD does not use multiple_test_data template. This function is a placeholder for preventing error.
- setup_test_data(test_data_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]#
(Optionally) Setups data loader to be used in test
- Parameters
test_data_layer_config – test data layer parameters.
Returns:
- setup_training_data(train_data_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]#
Setups data loader to be used in training
- Parameters
train_data_layer_config – training data layer parameters.
Returns: