Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
NeMo Speaker Diarization API#
Model Classes#
- class nemo.collections.asr.models.ClusteringDiarizer(*args: Any, **kwargs: Any)#
Bases:
Module
,Model
,DiarizationMixin
Inference model Class for offline speaker diarization. This class handles required functionality for diarization : Speech Activity Detection, Segmentation, Extract Embeddings, Clustering, Resegmentation and Scoring. All the parameters are passed through config file
- diarize(
- paths2audio_files: List[str] | None = None,
- batch_size: int = 0,
Diarize files provided through paths2audio_files or manifest file input: paths2audio_files (List[str]): list of paths to file containing audio file batch_size (int): batch_size considered for extraction of speaker embeddings and VAD computation
- classmethod list_available_models()#
Should list all pre-trained models available via NVIDIA NGC cloud. Note: There is no check that requires model names and aliases to be unique. In the case of a collision, whatever model (or alias) is listed first in the this returned list will be instantiated.
- Returns:
A list of PretrainedModelInfo entries
- classmethod restore_from(
- restore_path: str,
- override_config_path: str | None = None,
- map_location: torch.device | None = None,
- strict: bool = False,
Restores model instance (weights and configuration) from a .nemo file
- Parameters:
restore_path – path to .nemo file from which model should be instantiated
override_config_path – path to a yaml config that will override the internal config file or an OmegaConf / DictConfig object representing the model config.
map_location – Optional torch.device() to map the instantiated model to a device. By default (None), it will select a GPU if available, falling back to CPU otherwise.
strict – Passed to load_state_dict. By default True
return_config – If set to true, will return just the underlying config of the restored model as an OmegaConf DictConfig object without instantiating the model.
trainer – An optional Trainer object, passed to the model constructor.
save_restore_connector – An optional SaveRestoreConnector object that defines the implementation of the restore_from() method.
- save_to(save_path: str)#
- Saves model instance (weights and configuration) into EFF archive or .
You can use “restore_from” method to fully restore instance from .nemo file.
- .nemo file is an archive (tar.gz) with the following:
model_config.yaml - model configuration in .yaml format. You can deserialize this into cfg argument for model’s constructor model_wights.chpt - model checkpoint
- Parameters:
save_path – Path to .nemo file where model instance should be saved
- class nemo.collections.asr.models.EncDecDiarLabelModel(*args: Any, **kwargs: Any)#
Bases:
ModelPT
,ExportableEncDecModel
Encoder decoder class for multiscale diarization decoder (MSDD). Model class creates training, validation methods for setting up data performing model forward pass.
- This model class expects config dict for:
preprocessor
msdd_model
speaker_model
- _init_segmentation_info()#
Initialize segmentation settings: window, shift and multiscale weights.
- _init_speaker_model()#
Initialize speaker embedding model with model name or path passed through config. Note that speaker embedding model is loaded to self.msdd to enable multi-gpu and multi-node training. In addition, speaker embedding model is also saved with msdd model when .ckpt files are saved.
- add_speaker_model_config(cfg)#
Add config dictionary of the speaker model to the model’s config dictionary. This is required to save and load speaker model with MSDD model.
- Parameters:
cfg (DictConfig) – DictConfig type variable that conatains hyperparameters of MSDD model.
- compute_accuracies()#
Calculate F1 score and accuracy of the predicted sigmoid values.
- Returns:
F1 score of the estimated diarized speaker label sequences. simple_acc (float): Accuracy of predicted speaker labels: (total # of correct labels)/(total # of sigmoid values)
- Return type:
f1_score (float)
- forward_infer(
- input_signal,
- input_signal_length,
- emb_vectors,
- targets,
Wrapper function for inference case.
- get_cluster_avg_embs_model(
- embs: torch.Tensor,
- clus_label_index: torch.Tensor,
- ms_seg_counts: torch.Tensor,
- scale_mapping,
Calculate the cluster-average speaker embedding based on the ground-truth speaker labels (i.e., cluster labels).
- Parameters:
embs (Tensor) – Merged embeddings without zero-padding in the batch. See ms_seg_counts for details. Shape: (Total number of segments in the batch, emb_dim)
clus_label_index (Tensor) – Merged ground-truth cluster labels from all scales with zero-padding. Each scale’s index can be retrieved by using segment index in ms_seg_counts. Shape: (batch_size, maximum total segment count among the samples in the batch)
ms_seg_counts (Tensor) –
Cumulative sum of the number of segments in each scale. This information is needed to reconstruct multi-scale input tensors during forward propagating.
- Example: batch_size=3, scale_n=6, emb_dim=192
ms_seg_counts = [ [ 8, 9, 12, 16, 25, 51], [11, 13, 14, 17, 25, 51], [ 9, 9, 11, 16, 23, 50] ]
Counts of merged segments: (121, 131, 118) embs has shape of (370, 192) clus_label_index has shape of (3, 131)
Shape: (batch_size, scale_n)
- Returns:
Multi-scale cluster-average speaker embedding vectors. These embedding vectors are used as reference for each speaker to predict the speaker label for the given multi-scale embedding sequences. Shape: (batch_size, scale_n, emb_dim, self.num_spks_per_model)
- Return type:
ms_avg_embs (Tensor)
- get_ms_emb_seq(
- embs: torch.Tensor,
- scale_mapping: torch.Tensor,
- ms_seg_counts: torch.Tensor,
Reshape the given tensor and organize the embedding sequence based on the original sequence counts. Repeat the embeddings according to the scale_mapping information so that the final embedding sequence has the identical length for all scales.
- Parameters:
embs (Tensor) – Merged embeddings without zero-padding in the batch. See ms_seg_counts for details. Shape: (Total number of segments in the batch, emb_dim)
scale_mapping (Tensor) –
The element at the m-th row and the n-th column of the scale mapping matrix indicates the (m+1)-th scale segment index which has the closest center distance with (n+1)-th segment in the base scale. .. rubric:: Example
scale_mapping_argmat[2][101] = 85
In the above example, it means that 86-th segment in the 3rd scale (python index is 2) is mapped with 102-th segment in the base scale. Thus, the longer segments bound to have more repeating numbers since multiple base scale segments (since the base scale has the shortest length) fall into the range of the longer segments. At the same time, each row contains N numbers of indices where N is number of segments in the base-scale (i.e., the finest scale). Shape: (batch_size, scale_n, self.diar_window_length)
ms_seg_counts (Tensor) –
Cumulative sum of the number of segments in each scale. This information is needed to reconstruct the multi-scale input matrix during forward propagating.
- Example: batch_size=3, scale_n=6, emb_dim=192
- ms_seg_counts =
- [[8, 9, 12, 16, 25, 51],
[11, 13, 14, 17, 25, 51], [ 9, 9, 11, 16, 23, 50]]
In this function, ms_seg_counts is used to get the actual length of each embedding sequence without zero-padding.
- Returns:
Multi-scale embedding sequence that is mapped, matched and repeated. The longer scales are less repeated, while shorter scales are more frequently repeated following the scale mapping tensor.
- Return type:
ms_emb_seq (Tensor)
- get_ms_mel_feat(
- processed_signal: torch.Tensor,
- processed_signal_len: torch.Tensor,
- ms_seg_timestamps: torch.Tensor,
- ms_seg_counts: torch.Tensor,
Load acoustic feature from audio segments for each scale and save it into a torch.tensor matrix. In addition, create variables containing the information of the multiscale subsegmentation information.
Note: self.emb_batch_size determines the number of embedding tensors attached to the computational graph. If self.emb_batch_size is greater than 0, speaker embedding models are simultaneosly trained. Due to the constrant of GPU memory size, only a subset of embedding tensors can be attached to the computational graph. By default, the graph-attached embeddings are selected randomly by torch.randperm. Default value of self.emb_batch_size is 0.
- Parameters:
processed_signal (Tensor) – Zero-padded Feature input. Shape: (batch_size, feat_dim, the longest feature sequence length)
processed_signal_len (Tensor) – The actual legnth of feature input without zero-padding. Shape: (batch_size,)
ms_seg_timestamps (Tensor) – Timestamps of the base-scale segments. Shape: (batch_size, scale_n, number of base-scale segments, self.num_spks_per_model)
ms_seg_counts (Tensor) – Cumulative sum of the number of segments in each scale. This information is needed to reconstruct the multi-scale input matrix during forward propagating. Shape: (batch_size, scale_n)
- Returns:
- Feature input stream split into the same length.
Shape: (total number of segments, feat_dim, self.frame_per_sec * the-longest-scale-length)
- ms_mel_feat_len (Tensor):
The actual length of feature without zero-padding. Shape: (total number of segments,)
- seq_len (Tensor):
The length of the input embedding sequences. Shape: (total number of segments,)
- detach_ids (tuple):
Tuple containing both detached embeding indices and attached embedding indices
- Return type:
ms_mel_feat (Tensor)
- setup_test_data(
- test_data_config: omegaconf.DictConfig | Dict | None,
(Optionally) Setups data loader to be used in test
- Parameters:
test_data_layer_config – test data layer parameters.
Returns:
- setup_training_data(
- train_data_config: omegaconf.DictConfig | Dict | None,
Setups data loader to be used in training
- Parameters:
train_data_layer_config – training data layer parameters.
Returns:
- setup_validation_data(
- val_data_layer_config: omegaconf.DictConfig | Dict | None,
Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.
Returns:
Mixins#
- class nemo.collections.asr.parts.mixins.mixins.DiarizationMixin#
Bases:
VerificationMixin
- abstract diarize(
- paths2audio_files: List[str],
- batch_size: int = 1,
Takes paths to audio files and returns speaker labels :param paths2audio_files: paths to audio fragment to be transcribed
- Returns:
Speaker labels