NeMo Speaker Recognition API#
Model Classes#
- class nemo.collections.asr.models.label_models.EncDecSpeakerLabelModel(*args: Any, **kwargs: Any)[source]#
Bases:
ModelPT
,ExportableEncDecModel
Encoder decoder class for speaker label models. Model class creates training, validation methods for setting up data performing model forward pass. Expects config dict for
preprocessor
Jasper/Quartznet Encoder
Speaker Decoder
- batch_inference(manifest_filepath, batch_size=32, sample_rate=16000, device='cuda')#
Perform batch inference on EncDecSpeakerLabelModel. To perform inference on single audio file, once can use infer_model, get_label or get_embedding
- To map predicted labels, one can do
arg_values = logits.argmax(axis=1) pred_labels = list(map(lambda t : trained_labels[t], arg_values))
- Parameters
manifest_filepath – Path to manifest file
batch_size – batch size to perform batch inference
sample_rate – sample rate of audio files in manifest file
device – compute device to perform operations.
- Returns
The variables below all follow the audio file order in the manifest file. embs: embeddings of files provided in manifest file logits: logits of final layer of EncDecSpeakerLabel Model gt_labels: labels from manifest file (needed for speaker enrollment and testing) trained_labels: Classification labels sorted in the order that they are mapped by the trained model
- forward_for_export(processed_signal, processed_signal_len)[source]#
This forward is used when we need to export the model to ONNX format. Inputs cache_last_channel and cache_last_time are needed to be passed for exporting streaming models. :param input: Tensor that represents a batch of raw audio signals,
of shape [B, T]. T here represents timesteps.
- Parameters
length – Vector of length B, that contains the individual lengths of the audio sequences.
cache_last_channel – Tensor of shape [N, B, T, H] which contains the cache for last channel layers
cache_last_time – Tensor of shape [N, B, H, T] which contains the cache for last time layers N is the number of such layers which need caching, B is batch size, H is the hidden size of activations, and T is the length of the cache
- Returns
the output of the model
- get_embedding(path2audio_file)[source]#
Returns the speaker embeddings for a provided audio file.
- Parameters
path2audio_file – path to an audio wav file
- Returns
speaker embeddings (Audio representations)
- Return type
emb
- get_label(path2audio_file)[source]#
Returns label of path2audio_file from classes the model was trained on. :param path2audio_file: path to audio wav file
- Returns
label corresponding to the trained model
- Return type
label
- infer_file(path2audio_file)#
- Parameters
path2audio_file – path to an audio wav file
- Returns
speaker embeddings (Audio representations) logits: logits corresponding of final layer
- Return type
emb
- property input_types: Optional[Dict[str, NeuralType]]#
Define these to enable input neural type checks
- classmethod list_available_models() List[PretrainedModelInfo] [source]#
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
- multi_test_epoch_end(outputs, dataloader_idx: int = 0)[source]#
Adds support for multiple test datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.
- Parameters
outputs – Same as that provided by LightningModule.on_validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.
- Returns
A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.
- multi_validation_epoch_end(outputs, dataloader_idx: int = 0)[source]#
Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.
- Parameters
outputs – Same as that provided by LightningModule.on_validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.
- Returns
A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.
- property output_types: Optional[Dict[str, NeuralType]]#
Define these to enable output neural type checks
- setup_test_data(test_data_layer_params: Optional[Union[omegaconf.DictConfig, Dict]])[source]#
(Optionally) Setups data loader to be used in test
- Parameters
test_data_layer_config – test data layer parameters.
Returns:
- setup_training_data(train_data_layer_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]#
Setups data loader to be used in training
- Parameters
train_data_layer_config – training data layer parameters.
Returns:
- setup_validation_data(val_data_layer_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]#
Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.
Returns:
- verify_speakers(path2audio_file1, path2audio_file2, threshold=0.7)#
Verify if two audio files are from the same speaker or not.
- Parameters
path2audio_file1 – path to audio wav file of speaker 1
path2audio_file2 – path to audio wav file of speaker 2
threshold – cosine similarity score used as a threshold to distinguish two embeddings (default = 0.7)
- Returns
True if both audio files are from same speaker, False otherwise