NeMo Speaker Recognition API#

Model Classes#

class nemo.collections.asr.models.label_models.EncDecSpeakerLabelModel(*args: Any, **kwargs: Any)[source]#

Bases: ModelPT, ExportableEncDecModel

Encoder decoder class for speaker label models. Model class creates training, validation methods for setting up data performing model forward pass. Expects config dict for

  • preprocessor

  • Jasper/Quartznet Encoder

  • Speaker Decoder

batch_inference(manifest_filepath, batch_size=32, sample_rate=16000, device='cuda')#

Perform batch inference on EncDecSpeakerLabelModel. To perform inference on single audio file, once can use infer_model, get_label or get_embedding

To map predicted labels, one can do

arg_values = logits.argmax(axis=1) pred_labels = list(map(lambda t : trained_labels[t], arg_values))

Parameters
  • manifest_filepath – Path to manifest file

  • batch_size – batch size to perform batch inference

  • sample_rate – sample rate of audio files in manifest file

  • device – compute device to perform operations.

Returns

The variables below all follow the audio file order in the manifest file. embs: embeddings of files provided in manifest file logits: logits of final layer of EncDecSpeakerLabel Model gt_labels: labels from manifest file (needed for speaker enrollment and testing) trained_labels: Classification labels sorted in the order that they are mapped by the trained model

evaluation_step(batch, batch_idx, dataloader_idx: int = 0, tag: str = 'val')[source]#
static extract_labels(data_layer_config)[source]#
forward(input_signal, input_signal_length)[source]#
forward_for_export(processed_signal, processed_signal_len)[source]#

This forward is used when we need to export the model to ONNX format. Inputs cache_last_channel and cache_last_time are needed to be passed for exporting streaming models. :param input: Tensor that represents a batch of raw audio signals,

of shape [B, T]. T here represents timesteps.

Parameters
  • length – Vector of length B, that contains the individual lengths of the audio sequences.

  • cache_last_channel – Tensor of shape [N, B, T, H] which contains the cache for last channel layers

  • cache_last_time – Tensor of shape [N, B, H, T] which contains the cache for last time layers N is the number of such layers which need caching, B is batch size, H is the hidden size of activations, and T is the length of the cache

Returns

the output of the model

get_embedding(path2audio_file)[source]#

Returns the speaker embeddings for a provided audio file.

Parameters

path2audio_file – path to an audio wav file

Returns

speaker embeddings (Audio representations)

Return type

emb

get_label(path2audio_file)[source]#

Returns label of path2audio_file from classes the model was trained on. :param path2audio_file: path to audio wav file

Returns

label corresponding to the trained model

Return type

label

infer_file(path2audio_file)#
Parameters

path2audio_file – path to an audio wav file

Returns

speaker embeddings (Audio representations) logits: logits corresponding of final layer

Return type

emb

property input_types: Optional[Dict[str, NeuralType]]#

Define these to enable input neural type checks

classmethod list_available_models() List[PretrainedModelInfo][source]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

multi_evaluation_epoch_end(outputs, dataloader_idx: int = 0, tag: str = 'val')[source]#
multi_test_epoch_end(outputs, dataloader_idx: int = 0)[source]#

Adds support for multiple test datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters
  • outputs – Same as that provided by LightningModule.on_validation_epoch_end() for a single dataloader.

  • dataloader_idx – int representing the index of the dataloader.

Returns

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

multi_validation_epoch_end(outputs, dataloader_idx: int = 0)[source]#

Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters
  • outputs – Same as that provided by LightningModule.on_validation_epoch_end() for a single dataloader.

  • dataloader_idx – int representing the index of the dataloader.

Returns

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

property output_types: Optional[Dict[str, NeuralType]]#

Define these to enable output neural type checks

setup_test_data(test_data_layer_params: Optional[Union[omegaconf.DictConfig, Dict]])[source]#

(Optionally) Setups data loader to be used in test

Parameters

test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_layer_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]#

Setups data loader to be used in training

Parameters

train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_layer_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]#

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

test_dataloader()[source]#
test_step(batch, batch_idx, dataloader_idx: int = 0)[source]#
training_step(batch, batch_idx)[source]#
validation_step(batch, batch_idx, dataloader_idx: int = 0)[source]#
verify_speakers(path2audio_file1, path2audio_file2, threshold=0.7)#

Verify if two audio files are from the same speaker or not.

Parameters
  • path2audio_file1 – path to audio wav file of speaker 1

  • path2audio_file2 – path to audio wav file of speaker 2

  • threshold – cosine similarity score used as a threshold to distinguish two embeddings (default = 0.7)

Returns

True if both audio files are from same speaker, False otherwise