NeMo Speaker Recognition API#
Model Classes#
- class nemo.collections.asr.models.label_models.EncDecSpeakerLabelModel(*args: Any, **kwargs: Any)[source]#
Bases:
ModelPT
,ExportableEncDecModel
Encoder decoder class for speaker label models. Model class creates training, validation methods for setting up data performing model forward pass. Expects config dict for
preprocessor
Jasper/Quartznet Encoder
Speaker Decoder
- batch_inference(manifest_filepath, batch_size=32, sample_rate=16000, device='cuda')#
Perform batch inference on EncDecSpeakerLabelModel. To perform inference on single audio file, once can use infer_model, get_label or get_embedding
- To map predicted labels, one can do
arg_values = logits.argmax(axis=1) pred_labels = list(map(lambda t : trained_labels[t], arg_values))
- Parameters
manifest_filepath – Path to manifest file
batch_size – batch size to perform batch inference
sample_rate – sample rate of audio files in manifest file
device – compute device to perform operations.
- Returns
The variables below all follow the audio file order in the manifest file. embs: embeddings of files provided in manifest file logits: logits of final layer of EncDecSpeakerLabel Model gt_labels: labels from manifest file (needed for speaker enrollment and testing) trained_labels: Classification labels sorted in the order that they are mapped by the trained model
- forward_for_export(processed_signal, processed_signal_len)[source]#
This forward is used when we need to export the model to ONNX format. Inputs cache_last_channel and cache_last_time are needed to be passed for exporting streaming models. When they are passed, it just passes the inputs through the encoder part and currently the ONNX conversion does not fully work for this case. :param input: Tensor that represents a batch of raw audio signals,
of shape [B, T]. T here represents timesteps.
- Parameters
length – Vector of length B, that contains the individual lengths of the audio sequences.
cache_last_channel – Tensor of shape [N, B, T, H] which contains the cache for last channel layers
cache_last_time – Tensor of shape [N, B, H, T] which contains the cache for last time layers N is the number of such layers which need caching, B is batch size, H is the hidden size of activations, and T is the length of the cache
- Returns
the output of the model
- get_embedding(path2audio_file)[source]#
Returns the speaker embeddings for a provided audio file.
- Parameters
path2audio_file – path to an audio wav file
- Returns
speaker embeddings (Audio representations)
- Return type
emb
- get_label(path2audio_file)[source]#
Returns label of path2audio_file from classes the model was trained on. :param path2audio_file: path to audio wav file
- Returns
label corresponding to the trained model
- Return type
label
- infer_file(path2audio_file)#
- Parameters
path2audio_file – path to an audio wav file
- Returns
speaker embeddings (Audio representations) logits: logits corresponding of final layer
- Return type
emb
- property input_types: Optional[Dict[str, NeuralType]]#
Define these to enable input neural type checks
- classmethod list_available_models() List[PretrainedModelInfo] [source]#
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.
- multi_test_epoch_end(outputs, dataloader_idx: int = 0)[source]#
Adds support for multiple test datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.
- Parameters
outputs – Same as that provided by LightningModule.validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.
- Returns
A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.
- multi_validation_epoch_end(outputs, dataloader_idx: int = 0)[source]#
Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.
- Parameters
outputs – Same as that provided by LightningModule.validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.
- Returns
A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.
- property output_types: Optional[Dict[str, NeuralType]]#
Define these to enable output neural type checks
- setup_test_data(test_data_layer_params: Optional[Union[omegaconf.DictConfig, Dict]])[source]#
(Optionally) Setups data loader to be used in test
- Parameters
test_data_layer_config – test data layer parameters.
Returns:
- setup_training_data(train_data_layer_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]#
Setups data loader to be used in training
- Parameters
train_data_layer_config – training data layer parameters.
Returns:
- setup_validation_data(val_data_layer_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]#
Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.
Returns:
- verify_speakers(path2audio_file1, path2audio_file2, threshold=0.7)#
Verify if two audio files are from the same speaker or not.
- Parameters
path2audio_file1 – path to audio wav file of speaker 1
path2audio_file2 – path to audio wav file of speaker 2
threshold – cosine similarity score used as a threshold to distinguish two embeddings (default = 0.7)
- Returns
True if both audio files are from same speaker, False otherwise