NeMo Speaker Recognition API#

Model Classes#

class nemo.collections.asr.models.label_models.EncDecSpeakerLabelModel(*args: Any, **kwargs: Any)[source]#

Bases: nemo.core.classes.modelPT.ModelPT, nemo.collections.asr.models.asr_model.ExportableEncDecModel

Encoder decoder class for speaker label models. Model class creates training, validation methods for setting up data performing model forward pass. Expects config dict for

  • preprocessor

  • Jasper/Quartznet Encoder

  • Speaker Decoder

batch_inference(manifest_filepath, batch_size=32, sample_rate=16000, device='cuda')#

Perform batch inference on EncDecSpeakerLabelModel. To perform inference on single audio file, once can use infer_model, get_label or get_embedding

To map predicted labels, one can do

arg_values = logits.argmax(axis=1) pred_labels = list(map(lambda t : pred_labels[t], arg_values))

Parameters
  • manifest_filepath – Path to manifest file

  • batch_size – batch size to perform batch inference

  • sample_rate – sample rate of audio files in manifest file

  • device – compute device to perform operations.

Returns

The variables below all follow the audio file order in the manifest file. embs: embeddings of files provided in manifest file logits: logits of final layer of EncDecSpeakerLabel Model gt_labels: labels from manifest file (needed for speaker enrollment and testing) mapped_labels: Classification labels sorted in the order that they are mapped by the trained model

evaluation_step(batch, batch_idx, dataloader_idx: int = 0, tag: str = 'val')[source]#
static extract_labels(data_layer_config)[source]#
forward(input_signal, input_signal_length)[source]#
forward_for_export(processed_signal, processed_signal_len)[source]#

This forward is used when we need to export the model to ONNX format. Inputs cache_last_channel and cache_last_time are needed to be passed for exporting streaming models. When they are passed, it just passes the inputs through the encoder part and currently the ONNX conversion does not fully work for this case. :param input: Tensor that represents a batch of raw audio signals,

of shape [B, T]. T here represents timesteps.

Parameters
  • length – Vector of length B, that contains the individual lengths of the audio sequences.

  • cache_last_channel – Tensor of shape [N, B, T, H] which contains the cache for last channel layers

  • cache_last_time – Tensor of shape [N, B, H, T] which contains the cache for last time layers N is the number of such layers which need caching, B is batch size, H is the hidden size of activations, and T is the length of the cache

Returns

the output of the model

get_embedding(path2audio_file)[source]#

Returns the speaker embeddings for a provided audio file.

Parameters

path2audio_file – path to an audio wav file

Returns

speaker embeddings (Audio representations)

Return type

emb

get_label(path2audio_file)[source]#

Returns label of path2audio_file from classes the model was trained on. :param path2audio_file: path to audio wav file

Returns

label corresponding to the trained model

Return type

label

infer_file(path2audio_file)#
Parameters

path2audio_file – path to an audio wav file

Returns

speaker embeddings (Audio representations) logits: logits corresponding of final layer

Return type

emb

property input_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#

Define these to enable input neural type checks

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

multi_evaluation_epoch_end(outputs, dataloader_idx: int = 0, tag: str = 'val')[source]#
multi_test_epoch_end(outputs, dataloader_idx: int = 0)[source]#

Adds support for multiple test datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters
  • outputs – Same as that provided by LightningModule.validation_epoch_end() for a single dataloader.

  • dataloader_idx – int representing the index of the dataloader.

Returns

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

multi_validation_epoch_end(outputs, dataloader_idx: int = 0)[source]#

Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters
  • outputs – Same as that provided by LightningModule.validation_epoch_end() for a single dataloader.

  • dataloader_idx – int representing the index of the dataloader.

Returns

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]#

Define these to enable output neural type checks

setup_test_data(test_data_layer_params: Optional[Union[omegaconf.DictConfig, Dict]])[source]#

(Optionally) Setups data loader to be used in test

Parameters

test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_layer_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]#

Setups data loader to be used in training

Parameters

train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_layer_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]#

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

test_dataloader()[source]#
test_step(batch, batch_idx, dataloader_idx: int = 0)[source]#
training_step(batch, batch_idx)[source]#
validation_step(batch, batch_idx, dataloader_idx: int = 0)[source]#
verify_speakers(path2audio_file1, path2audio_file2, threshold=0.7)#

Verify if two audio files are from the same speaker or not.

Parameters
  • path2audio_file1 – path to audio wav file of speaker 1

  • path2audio_file2 – path to audio wav file of speaker 2

  • threshold – cosine similarity score used as a threshold to distinguish two embeddings (default = 0.7)

Returns

True if both audio files are from same speaker, False otherwise