NeMo ASR collection API¶

Model Classes¶

class nemo.collections.asr.models.EncDecCTCModel(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

Base class for encoder decoder CTC-based models.

change_vocabulary(new_vocabulary: List[str])[source]¶

Changes vocabulary used during CTC decoding process. Use this method when fine-tuning on from pre-trained model. This method changes only decoder and leaves encoder and pre-processing modules unchanged. For example, you would use it if you want to use pretrained encoder when fine-tuning on a data in another language, or when you’d need model to learn capitalization, punctuation and/or special characters.

If new_vocabulary == self.decoder.vocabulary then nothing will be changed.

Parameters: new_vocabulary – list with new vocabulary. Must contain at least 2 elements. Typically, this is target alphabet.

Returns: None

forward(input_signal=None, input_signal_length=None, processed_signal=None, processed_signal_length=None)[source]¶

Forward pass of the model.

Parameters

input_signal – Tensor that represents a batch of raw audio signals, of shape [B, T]. T here represents timesteps, with 1 second of audio represented as self.sample_rate number of floating point values.
input_signal_length – Vector of length B, that contains the individual lengths of the audio sequences.
processed_signal – Tensor that represents a batch of processed audio signals, of shape (B, D, T) that has undergone processing via some DALI preprocessor.
processed_signal_length – Vector of length B, that contains the individual lengths of the processed audio sequences.

Returns

A tuple of 3 elements - 1) The log probabilities tensor of shape [B, T, D]. 2) The lengths of the acoustic sequence after propagation through the encoder, of shape [B]. 3) The greedy token predictions of the model of shape [B, T] (via argmax)

property input_types¶: Define these to enable input neural type checks

classmethod list_available_models() → Optional[nemo.core.classes.common.PretrainedModelInfo][source]¶

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns: List of available pre-trained models.

property output_types¶: Define these to enable output neural type checks

setup_test_data(test_data_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]¶

Sets up the test data loader via a Dict-like object.

Parameters: test_data_config – A config that contains the information regarding construction of an ASR Training dataset.

Supported Datasets:

AudioToCharDataset
AudioToBPEDataset
TarredAudioToCharDataset
TarredAudioToBPEDataset
AudioToCharDALIDataset

setup_training_data(train_data_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]¶

Sets up the training data loader via a Dict-like object.

Parameters: train_data_config – A config that contains the information regarding construction of an ASR Training dataset.

Supported Datasets:

AudioToCharDataset
AudioToBPEDataset
TarredAudioToCharDataset
TarredAudioToBPEDataset
AudioToCharDALIDataset

setup_validation_data(val_data_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]¶

Sets up the validation data loader via a Dict-like object.

Parameters: val_data_config – A config that contains the information regarding construction of an ASR Training dataset.

Supported Datasets:

AudioToCharDataset
AudioToBPEDataset
TarredAudioToCharDataset
TarredAudioToBPEDataset
AudioToCharDALIDataset

test_dataloader()[source]¶

test_step(batch, batch_idx, dataloader_idx=0)[source]¶

training_step(batch, batch_nb)[source]¶

transcribe(paths2audio_files: List[str], batch_size: int = 4, logprobs=False) → List[str]¶

Uses greedy decoding to transcribe audio files. Use this method for debugging and prototyping.

Parameters

paths2audio_files – (a list) of paths to audio files. Recommended length per file is between 5 and 25 seconds. But it is possible to pass a few hours long file if enough GPU memory is available.
batch_size – (int) batch size to use during inference. Bigger will result in better throughput performance but would use more memory.
logprobs – (bool) pass True to get log probabilities instead of transcripts.

Returns

A list of transcriptions (or raw log probabilities if logprobs is True) in the same order as paths2audio_files

validation_step(batch, batch_idx, dataloader_idx=0)[source]¶

class nemo.collections.asr.models.EncDecCTCModelBPE(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

Encoder decoder CTC-based models with Byte Pair Encoding.

change_vocabulary(new_tokenizer_dir: str, new_tokenizer_type: str)[source]¶

Changes vocabulary of the tokenizer used during CTC decoding process. Use this method when fine-tuning on from pre-trained model. This method changes only decoder and leaves encoder and pre-processing modules unchanged. For example, you would use it if you want to use pretrained encoder when fine-tuning on a data in another language, or when you’d need model to learn capitalization, punctuation and/or special characters.

Parameters

new_tokenizer_dir – Path to the new tokenizer directory.
new_tokenizer_type – Either bpe or wpe. bpe is used for SentencePiece tokenizers, whereas wpe is used for BertTokenizer.

Returns: None

classmethod list_available_models() → Optional[nemo.core.classes.common.PretrainedModelInfo][source]¶

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns: List of available pre-trained models.

class nemo.collections.asr.models.EncDecRNNTModel(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

Base class for encoder decoder RNNT-based models.

change_decoding_strategy(decoding_cfg: omegaconf.DictConfig)[source]¶

Changes decoding strategy used during RNNT decoding process.

Parameters: decoding_cfg – A config for the decoder, which is optional. If the decoding type needs to be changed (from say Greedy to Beam decoding etc), the config can be passed here.

change_vocabulary(new_vocabulary: List[str], decoding_cfg: Optional[omegaconf.DictConfig] = None)[source]¶

Changes vocabulary used during RNNT decoding process. Use this method when fine-tuning a pre-trained model. This method changes only decoder and leaves encoder and pre-processing modules unchanged. For example, you would use it if you want to use pretrained encoder when fine-tuning on data in another language, or when you’d need model to learn capitalization, punctuation and/or special characters.

Parameters

new_vocabulary – list with new vocabulary. Must contain at least 2 elements. Typically, this is target alphabet.
decoding_cfg – A config for the decoder, which is optional. If the decoding type needs to be changed (from say Greedy to Beam decoding etc), the config can be passed here.

Returns: None

forward(input_signal=None, input_signal_length=None, processed_signal=None, processed_signal_length=None)[source]¶

Forward pass of the model. Note that for RNNT Models, the forward pass of the model is a 3 step process, and this method only performs the first step - forward of the acoustic model.

Please refer to the training_step in order to see the full forward step for training - which performs the forward of the acoustic model, the prediction network and then the joint network. Finally, it computes the loss and possibly compute the detokenized text via the decoding step.

Please refer to the validation_step in order to see the full forward step for inference - which performs the forward of the acoustic model, the prediction network and then the joint network. Finally, it computes the decoded tokens via the decoding step and possibly compute the batch metrics.

Parameters

input_signal – Tensor that represents a batch of raw audio signals, of shape [B, T]. T here represents timesteps, with 1 second of audio represented as self.sample_rate number of floating point values.
input_signal_length – Vector of length B, that contains the individual lengths of the audio sequences.
processed_signal – Tensor that represents a batch of processed audio signals, of shape (B, D, T) that has undergone processing via some DALI preprocessor.
processed_signal_length – Vector of length B, that contains the individual lengths of the processed audio sequences.

Returns

A tuple of 2 elements - 1) The log probabilities tensor of shape [B, T, D]. 2) The lengths of the acoustic sequence after propagation through the encoder, of shape [B].

property input_types¶: Define these to enable input neural type checks

classmethod list_available_models() → Optional[nemo.core.classes.common.PretrainedModelInfo][source]¶

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns: List of available pre-trained models.

multi_test_epoch_end(outputs, dataloader_idx: int = 0)[source]¶

Adds support for multiple test datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters

outputs – Same as that provided by LightningModule.validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.

Returns

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

multi_validation_epoch_end(outputs, dataloader_idx: int = 0)[source]¶

Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters

outputs – Same as that provided by LightningModule.validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.

Returns

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

on_after_backward()[source]¶

property output_types¶: Define these to enable output neural type checks

setup_test_data(test_data_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]¶

Sets up the test data loader via a Dict-like object.

Parameters: test_data_config – A config that contains the information regarding construction of an ASR Training dataset.

Supported Datasets:

AudioToCharDataset
AudioToBPEDataset
TarredAudioToCharDataset
TarredAudioToBPEDataset
AudioToCharDALIDataset

setup_training_data(train_data_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]¶

Sets up the training data loader via a Dict-like object.

Parameters: train_data_config – A config that contains the information regarding construction of an ASR Training dataset.

Supported Datasets:

AudioToCharDataset
AudioToBPEDataset
TarredAudioToCharDataset
TarredAudioToBPEDataset
AudioToCharDALIDataset

setup_validation_data(val_data_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]¶

Sets up the validation data loader via a Dict-like object.

Parameters: val_data_config – A config that contains the information regarding construction of an ASR Training dataset.

Supported Datasets:

AudioToCharDataset
AudioToBPEDataset
TarredAudioToCharDataset
TarredAudioToBPEDataset
AudioToCharDALIDataset

test_step(batch, batch_idx, dataloader_idx=0)[source]¶

training_step(batch, batch_nb)[source]¶

transcribe(paths2audio_files: List[str], batch_size: int = 4, return_hypotheses: bool = False) → List[str]¶

Uses greedy decoding to transcribe audio files. Use this method for debugging and prototyping.

Parameters

paths2audio_files – (a list) of paths to audio files. Recommended length per file is between 5 and 25 seconds. But it is possible to pass a few hours long file if enough GPU memory is available.
batch_size – (int) batch size to use during inference. Bigger will result in better throughput performance but would use more memory.
return_hypotheses – (bool) Either return hypotheses or text

With hypotheses can do some postprocessing like getting timestamp or rescoring :returns: A list of transcriptions in the same order as paths2audio_files

validation_step(batch, batch_idx, dataloader_idx=0)[source]¶

class nemo.collections.asr.models.EncDecRNNTBPEModel(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

Base class for encoder decoder RNNT-based models with subword tokenization.

change_decoding_strategy(decoding_cfg: omegaconf.DictConfig)[source]¶

Changes decoding strategy used during RNNT decoding process.

Parameters: decoding_cfg – A config for the decoder, which is optional. If the decoding type needs to be changed (from say Greedy to Beam decoding etc), the config can be passed here.

change_vocabulary(new_tokenizer_dir: str, new_tokenizer_type: str, decoding_cfg: Optional[omegaconf.DictConfig] = None)[source]¶

Changes vocabulary used during RNNT decoding process. Use this method when fine-tuning on from pre-trained model. This method changes only decoder and leaves encoder and pre-processing modules unchanged. For example, you would use it if you want to use pretrained encoder when fine-tuning on data in another language, or when you’d need model to learn capitalization, punctuation and/or special characters.

Parameters

new_tokenizer_dir – Directory path to tokenizer.
new_tokenizer_type – Type of tokenizer. Can be either bpe or wpe.
decoding_cfg – A config for the decoder, which is optional. If the decoding type needs to be changed (from say Greedy to Beam decoding etc), the config can be passed here.

Returns: None

classmethod list_available_models() → Optional[nemo.core.classes.common.PretrainedModelInfo][source]¶

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns: List of available pre-trained models.

class nemo.collections.asr.models.EncDecClassificationModel(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

Encoder decoder Classification models.

change_labels(new_labels: List[str])[source]¶

Changes labels used by the decoder model. Use this method when fine-tuning on from pre-trained model. This method changes only decoder and leaves encoder and pre-processing modules unchanged. For example, you would use it if you want to use pretrained encoder when fine-tuning on a data in another dataset.

If new_labels == self.decoder.vocabulary then nothing will be changed.

Parameters: new_labels – list with new labels. Must contain at least 2 elements. Typically, this is set of labels for the dataset.

Returns: None

forward(input_signal, input_signal_length)[source]¶

property input_types¶: Define these to enable input neural type checks

classmethod list_available_models() → Optional[List[nemo.core.classes.common.PretrainedModelInfo]][source]¶

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns: List of available pre-trained models.

multi_test_epoch_end(outputs, dataloader_idx: int = 0)[source]¶

Adds support for multiple test datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters

outputs – Same as that provided by LightningModule.validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.

Returns

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

multi_validation_epoch_end(outputs, dataloader_idx: int = 0)[source]¶

Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters

outputs – Same as that provided by LightningModule.validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.

Returns

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

property output_types¶: Define these to enable output neural type checks

setup_test_data(test_data_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]¶

(Optionally) Setups data loader to be used in test

Parameters: test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]¶

Setups data loader to be used in training

Parameters: train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]¶

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

test_dataloader()[source]¶

test_step(batch, batch_idx, dataloader_idx=0)[source]¶

training_step(batch, batch_nb)[source]¶

transcribe(paths2audio_files: List[str], batch_size: int = 4, logprobs=False) → List[str]¶

Generate class labels for provided audio files. Use this method for debugging and prototyping.

Parameters

paths2audio_files – (a list) of paths to audio files. Recommended length per file is approximately 1 second.
batch_size – (int) batch size to use during inference. Bigger will result in better throughput performance but would use more memory.
logprobs – (bool) pass True to get log probabilities instead of class labels.

Returns

A list of transcriptions (or raw log probabilities if logprobs is True) in the same order as paths2audio_files

validation_step(batch, batch_idx, dataloader_idx=0)[source]¶

class nemo.collections.asr.models.EncDecSpeakerLabelModel(*args: Any, **kwargs: Any)[source]¶

Bases: pytorch_lightning., nemo.core.classes.common.Model

Encoder decoder class for speaker label models. Model class creates training, validation methods for setting up data performing model forward pass. Expects config dict for * preprocessor * Jasper/Quartznet Encoder * Speaker Decoder

forward(input_signal, input_signal_length)[source]¶

property input_types¶: Define these to enable input neural type checks

classmethod list_available_models() → List[nemo.core.classes.common.PretrainedModelInfo][source]¶: This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud. :returns: List of available pre-trained models.

multi_test_epoch_end(outputs, dataloader_idx: int = 0)[source]¶

Adds support for multiple test datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters

outputs – Same as that provided by LightningModule.validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.

Returns

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

multi_validation_epoch_end(outputs, dataloader_idx: int = 0)[source]¶

Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters

outputs – Same as that provided by LightningModule.validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.

Returns

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

property output_types¶: Define these to enable output neural type checks

setup_finetune_model(model_config: omegaconf.DictConfig)[source]¶: setup_finetune_model method sets up training data, validation data and test data with new provided config, this checks for the previous labels set up during training from scratch, if None, it sets up labels for provided finetune data from manifest files Args: model_config: cfg which has train_ds, optional validation_ds, optional test_ds and mandatory encoder and decoder model params make sure you set num_classes correctly for finetune data Returns: None

setup_test_data(test_data_layer_params: Optional[Union[omegaconf.DictConfig, Dict]])[source]¶

(Optionally) Setups data loader to be used in test

Parameters: test_data_layer_config – test data layer parameters.

Returns:

setup_training_data(train_data_layer_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]¶

Setups data loader to be used in training

Parameters: train_data_layer_config – training data layer parameters.

Returns:

setup_validation_data(val_data_layer_config: Optional[Union[omegaconf.DictConfig, Dict]])[source]¶

Setups data loader to be used in validation :param val_data_layer_config: validation data layer parameters.

Returns:

test_dataloader()[source]¶

test_step(batch, batch_idx, dataloader_idx: int = 0)[source]¶

training_step(batch, batch_idx)[source]¶

validation_step(batch, batch_idx, dataloader_idx: int = 0)[source]¶

Modules¶

class nemo.collections.asr.modules.ConvASREncoder(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

Convolutional encoder for ASR models. With this class you can implement JasperNet and QuartzNet models.

Based on these papers:: https://arxiv.org/pdf/1904.03288.pdf https://arxiv.org/pdf/1910.10261.pdf

property disabled_deployment_input_names¶: Implement this method to return a set of input names disabled for export

property disabled_deployment_output_names¶: Implement this method to return a set of output names disabled for export

forward(audio_signal, length=None)[source]¶

input_example()[source]¶: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types¶: Returns definitions of module input ports.

property output_types¶: Returns definitions of module output ports.

classmethod restore_from(restore_path: str)[source]¶: Restores module/model with weights

save_to(save_path: str)[source]¶: Saves module/model with weights

class nemo.collections.asr.modules.ConvASRDecoder(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

Simple ASR Decoder for use with CTC-based models such as JasperNet and QuartzNet

Based on these papers:: https://arxiv.org/pdf/1904.03288.pdf https://arxiv.org/pdf/1910.10261.pdf https://arxiv.org/pdf/2005.04290.pdf

forward(encoder_output)[source]¶

input_example()[source]¶: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types¶: Define these to enable input neural type checks

property num_classes_with_blank¶

property output_types¶: Define these to enable output neural type checks

classmethod restore_from(restore_path: str)[source]¶: Restores module/model with weights

save_to(save_path: str)[source]¶: Saves module/model with weights

property vocabulary¶

class nemo.collections.asr.modules.ConvASRDecoderClassification(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

Simple ASR Decoder for use with classification models such as JasperNet and QuartzNet

Based on these papers:: https://arxiv.org/pdf/2005.04290.pdf

forward(encoder_output)[source]¶

input_example()[source]¶: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types¶: Define these to enable input neural type checks

property num_classes¶

property output_types¶: Define these to enable output neural type checks

class nemo.collections.asr.modules.SpeakerDecoder(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

Speaker Decoder creates the final neural layers that maps from the outputs of Jasper Encoder to the embedding layer followed by speaker based softmax loss. :param feat_in: Number of channels being input to this module :type feat_in: int :param num_classes: Number of unique speakers in dataset :type num_classes: int :param emb_sizes: shapes of intermediate embedding layers (we consider speaker embbeddings from 1st of this layers)

Defaults to [1024,1024]

Parameters

pool_mode (str) – Pooling stratergy type. options are ‘gram’,’xvector’,’superVector’. Defaults to ‘xvector’
init_mode (str) – Describes how neural network parameters are initialized. Options are [‘xavier_uniform’, ‘xavier_normal’, ‘kaiming_uniform’,’kaiming_normal’]. Defaults to “xavier_uniform”.

affineLayer(inp_shape, out_shape, learn_mean=True)[source]¶

forward(encoder_output)[source]¶

input_example()[source]¶: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types¶: Define these to enable input neural type checks

property output_types¶: Define these to enable output neural type checks

class nemo.collections.asr.modules.ConformerEncoder(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

The encoder for ASR model of Conformer. Based on this paper: ‘Conformer: Convolution-augmented Transformer for Speech Recognition’ by Anmol Gulati et al. https://arxiv.org/abs/2005.08100

Parameters

feat_in (int) – the size of feature channels
n_layers (int) – number of layers of ConformerBlock
d_model (int) – the hidden size of the model
feat_out (int) – the size of the output features Defaults to -1 (means feat_out is d_model)
subsampling (str) – the method of subsampling, choices=[‘vggnet’, ‘striding’] Defaults to striding.
subsampling_factor (int) – the subsampling factor which should be power of 2 Defaults to 4.
subsampling_conv_channels (int) – the size of the convolutions in the subsampling module Defaults to -1 which would set it to d_model.
ff_expansion_factor (int) – the expansion factor in feed forward layers Defaults to 4.
self_attention_model (str) – type of the attention layer and positional encoding ‘rel_pos’: relative positional embedding and Transformer-XL ‘abs_pos’: absolute positional embedding and Transformer default is rel_pos.
pos_emb_max_len (int) – the maximum length of positional embeddings Defaulst to 5000
n_heads (int) – number of heads in multi-headed attention layers Defaults to 4.
xscaling (bool) – enables scaling the inputs to the multi-headed attention layers by sqrt(d_model) Defaults to True.
untie_biases (bool) – whether to not share (untie) the bias weights between layers of Transformer-XL Defaults to True.
conv_kernel_size (int) – the size of the convolutions in the convolutional modules Defaults to 31.
dropout (float) – the dropout rate used in all layers except the attention layers Defaults to 0.1.
dropout_emb (float) – the dropout rate used for the positional embeddings Defaults to 0.1.
dropout_att (float) – the dropout rate used for the attention layer Defaults to 0.0.

forward(audio_signal, length=None)[source]¶

input_example()[source]¶: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types¶: Returns definitions of module input ports.

static make_pad_mask(seq_lens, max_time, device=None)[source]¶: Make masking for padding.

property output_types¶: Returns definitions of module output ports.

Parts¶

class nemo.collections.asr.parts.jasper.JasperBlock(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn.

Constructs a single “Jasper” block. With modified parameters, also constructs other blocks for models such as QuartzNet and Citrinet.

For Jasper : separable flag should be False
For QuartzNet : separable flag should be True
For Citrinet : separable flag and se flag should be True

Note that above are general distinctions, each model has intricate differences that expand over multiple such blocks.

For further information about the differences between models which use JasperBlock, please review the configs for ASR models found in the ASR examples directory.

Parameters

inplanes – Number of input channels.
planes – Number of output channels.
repeat – Number of repeated sub-blocks (R) for this block.
kernel_size – Convolution kernel size across all repeated sub-blocks.
kernel_size_factor – Floating point scale value that is multiplied with kernel size, then rounded down to nearest odd integer to compose the kernel size. Defaults to 1.0.
stride – Stride of the convolutional layers.
dilation – Integer which defined dilation factor of kernel. Note that when dilation > 1, stride must be equal to 1.
padding – String representing type of padding. Currently only supports “same” padding, which symmetrically pads the input tensor with zeros.
dropout – Floating point value, determins percentage of output that is zeroed out.
activation – String representing activation functions. Valid activation functions are : {“hardtanh”: nn.Hardtanh, “relu”: nn.ReLU, “selu”: nn.SELU, “swish”: Swish}. Defaults to “relu”.
residual – Bool that determined whether a residual branch should be added or not. All residual branches are constructed using a pointwise convolution kernel, that may or may not perform strided convolution depending on the parameter residual_mode.
groups – Number of groups for Grouped Convolutions. Defaults to 1.
separable – Bool flag that describes whether Time-Channel depthwise separable convolution should be constructed, or ordinary convolution should be constructed.
heads – Number of “heads” for the masked convolution. Defaults to -1, which disables it.
normalization – String that represents type of normalization performed. Can be one of “batch”, “group”, “instance” or “layer” to compute BatchNorm1D, GroupNorm1D, InstanceNorm or LayerNorm (which are special cases of GroupNorm1D).
norm_groups – Number of groups used for GroupNorm (if normalization == “group”).
residual_mode – String argument which describes whether the residual branch should be simply added (“add”) or should first stride, then add (“stride_add”). Required when performing stride on parallel branch as well as utilizing residual add.
residual_panes – Number of residual panes, used for Jasper-DR models. Please refer to the paper.
conv_mask – Bool flag which determines whether to utilize masked convolutions or not. In general, it should be set to True.
se – Bool flag that determines whether Squeeze-and-Excitation layer should be used.
se_reduction_ratio – Integer value, which determines to what extend the hidden dimension of the SE intermediate step should be reduced. Larger values reduce number of parameters, but also limit the effectiveness of SE layers.
se_context_window – Integer value determining the number of timesteps that should be utilized in order to compute the averaged context window. Defaults to -1, which means it uses global context - such that all timesteps are averaged. If any positive integer is used, it will utilize limited context window of that size.
se_interpolation_mode – String used for interpolation mode of timestep dimension for SE blocks. Used only if context window is > 1. The modes available for resizing are: nearest, linear (3D-only), bilinear, area.
stride_last – Bool flag that determines whether all repeated blocks should stride at once, (stride of S^R when this flag is False) or just the last repeated block should stride (stride of S when this flag is True).
quantize – Bool flag whether to quantize the Convolutional blocks.

forward(input_: Tuple[List[torch.Tensor], Optional[torch.Tensor]])[source]¶

Forward pass of the module.

Parameters: input – The input is a tuple of two values - the preprocessed audio signal as well as the lengths of the audio signal. The audio signal is padded to the shape [B, D, T] and the lengths are a torch vector of length B.
Returns: The output of the block after processing the input through repeat number of sub-blocks, as well as the lengths of the encoded audio after padding/striding.

Mixins¶

class nemo.collections.asr.parts.mixins.ASRBPEMixin[source]¶

Bases: abc.ABC

ASR BPE Mixin class that sets up a Tokenizer via a config

This mixin class adds the method _setup_tokenizer(…), which can be used by ASR models which depend on subword tokenization.

The setup_tokenizer method adds the following parameters to the class -

tokenizer_cfg: The resolved config supplied to the tokenizer (with dir and type arguments).
tokenizer_dir: The directory path to the tokenizer vocabulary + additional metadata.
tokenizer_type: The type of the tokenizer. Currently supports bpe and wpe.
vocab_path: Resolved path to the vocabulary text file.

In addition to these variables, the method will also instantiate and preserve a tokenizer (subclass of TokenizerSpec) if successful, and assign it to self.tokenizer.

Datasets¶

Character Encoding Datasets¶

class nemo.collections.asr.data.audio_to_text.AudioToCharDataset(*args: Any, **kwargs: Any)[source]¶

Bases: torch.utils.data., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization

Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations (in seconds). Each new line is a different sample. Example below: {“audio_filepath”: “/path/to/audio.wav”, “text_filepath”: “/path/to/audio.txt”, “duration”: 23.147} … {“audio_filepath”: “/path/to/audio.wav”, “text”: “the transcription”, “offset”: 301.75, “duration”: 0.82, “utt”: “utterance_id”, “ctm_utt”: “en_4156”, “side”: “A”} :param manifest_filepath: Path to manifest json as described above. Can

be comma-separated paths.

Parameters

labels – String containing all the possible characters to map to
sample_rate (int) – Sample rate to resample loaded audio to
int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.
augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio
max_duration – If audio exceeds this length, do not include in dataset
min_duration – If audio is less than this length, do not include in dataset
max_utts – Limit number of utterances
blank_index – blank character index, default = -1
unk_index – unk_character index, default = -1
normalize – whether to normalize transcript text (default): True
bos_id – Id of beginning of sequence symbol to append if not None
eos_id – Id of end of sequence symbol to append if not None
load_audio – Boolean flag indicate whether do or not load audio
add_misc – True if add additional info dict.

property output_types¶: Returns definitions of module output ports.

class nemo.collections.asr.data.audio_to_text.TarredAudioToCharDataset(*args: Any, **kwargs: Any)[source]¶

Bases: torch.utils.data., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization

A similar Dataset to the AudioToCharDataset, but which loads tarred audio files.

Accepts a single comma-separated JSON manifest file (in the same style as for the AudioToCharDataset), as well as the path(s) to the tarball(s) containing the wav files. Each line of the manifest should contain the information for one audio file, including at least the transcript and name of the audio file within the tarball.

Valid formats for the audio_tar_filepaths argument include: (1) a single string that can be brace-expanded, e.g. ‘path/to/audio.tar’ or ‘path/to/audio_{1..100}.tar.gz’, or (2) a list of file paths that will not be brace-expanded, e.g. [‘audio_1.tar’, ‘audio_2.tar’, …].

See the WebDataset documentation for more information about accepted data and input formats.

If using multiple workers the number of shards should be divisible by world_size to ensure an even split among workers. If it is not divisible, logging will give a warning but training will proceed. In addition, if using mutiprocessing, each shard MUST HAVE THE SAME NUMBER OF ENTRIES after filtering is applied. We currently do not check for this, but your program may hang if the shards are uneven!

Notice that a few arguments are different from the AudioToCharDataset; for example, shuffle (bool) has been replaced by shuffle_n (int).

Additionally, please note that the len() of this DataLayer is assumed to be the length of the manifest after filtering. An incorrect manifest length may lead to some DataLoader issues down the line.

Parameters

audio_tar_filepaths – Either a list of audio tarball filepaths, or a string (can be brace-expandable).
manifest_filepath (str) – Path to the manifest.
labels (list) – List of characters that can be output by the ASR model. For Jasper, this is the 28 character set {a-z ‘}. The CTC blank symbol is automatically added later for models using ctc.
sample_rate (int) – Sample rate to resample loaded audio to
int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.
augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio
shuffle_n (int) – How many samples to look ahead and load to be shuffled. See WebDataset documentation for more details. Defaults to 0.
min_duration (float) – Dataset parameter. All training files which have a duration less than min_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to 0.1.
max_duration (float) – Dataset parameter. All training files which have a duration more than max_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to None.
max_utts (int) – Limit number of utterances. 0 means no maximum.
blank_index (int) – Blank character index, defaults to -1.
unk_index (int) – Unknown character index, defaults to -1.
normalize (bool) – Dataset parameter. Whether to use automatic text cleaning. It is highly recommended to manually clean text for best results. Defaults to True.
trim (bool) – Whether to use trim silence from beginning and end of audio signal using librosa.effects.trim(). Defaults to False.
bos_id (id) – Dataset parameter. Beginning of string symbol id used for seq2seq models. Defaults to None.
eos_id (id) – Dataset parameter. End of string symbol id used for seq2seq models. Defaults to None.
pad_id (id) – Token used to pad when collating samples in batches. If this is None, pads using 0s. Defaults to None.
shard_strategy (str) –
Tarred dataset shard distribution strategy chosen as a str value during ddp. - scatter: The default shard strategy applied by WebDataset, where each node gets

a unique set of shards, which are permanently pre-allocated and never changed at runtime.
- replicate: Optional shard strategy, where each node gets all of the set of shards available in the tarred dataset, which are permanently pre-allocated and never changed at runtime. The benefit of replication is that it allows each node to sample data points from the entire dataset independently of other nodes, and reduces dependence on value of shuffle_n.
  
  Note: Replicated strategy allows every node to sample the entire set of available tarfiles, and therefore more than one node may sample the same tarfile, and even sample the same data points! As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch.
global_rank (int) – Worker rank, used for partitioning shards. Defaults to 0.
world_size (int) – Total number of processes, used for partitioning shards. Defaults to 0.

class nemo.collections.asr.data.audio_to_text.AudioToCharWithDursDataset(*args: Any, **kwargs: Any)[source]¶

Bases: torch.utils.data., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization

Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations (in seconds). Each new line is a different sample. Example below: {“audio_filepath”: “/path/to/audio.wav”, “text_filepath”: “/path/to/audio.txt”, “duration”: 23.147} … {“audio_filepath”: “/path/to/audio.wav”, “text”: “the transcription”, “offset”: 301.75, “duration”: 0.82, “utt”: “utterance_id”, “ctm_utt”: “en_4156”, “side”: “A”}

Additionally, user provides path to precomputed durations, which is a pickled python dict with ‘tags’ and ‘durs’ keys, both of which are list of examples values. Tag is a unique example identifier, which is a wav filename without suffix. Durations are an additional tuple of two tensors: graphemes durations and blanks durations. Example below: {‘tags’: [‘LJ050-0234’, ‘LJ019-0373’],

‘durs’: [(graphemes_durs0, blanks_durs0), (graphemes_durs1, blanks_durs1)]}

Parameters

**kwargs – Passed to AudioToCharDataset constructor.
durs_path (str) – String path to pickled list of ‘[(tag, durs)]’ durations location.
rep (bool) – True if repeat text graphemes according to durs.
vocab – Vocabulary config (parser + set of graphemes to use). Constructor propagates these to self.make_vocab function call to build a complete vocabulary.

static make_vocab(notation='chars', punct=True, spaces=False, stresses=False)[source]¶

Constructs vocabulary from given parameters.

Parameters

notation (str) – Either ‘chars’ or ‘phonemes’ as general notation.
punct (bool) – True if reserve grapheme for basic punctuation.
spaces (bool) – True if prepend spaces to every punctuation symbol.
stresses (bool) – True if use phonemes codes with stresses (0-2).

Returns

(vocabs.Base) Vocabulary

property output_types¶: Returns definitions of module output ports.

Subword Encoding Datasets¶

class nemo.collections.asr.data.audio_to_text.AudioToBPEDataset(*args: Any, **kwargs: Any)[source]¶

Bases: torch.utils.data., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization

Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations (in seconds). Each new line is a different sample. Example below: {“audio_filepath”: “/path/to/audio.wav”, “text_filepath”: “/path/to/audio.txt”, “duration”: 23.147} … {“audio_filepath”: “/path/to/audio.wav”, “text”: “the transcription”, “offset”: 301.75, “duration”: 0.82, “utt”: “utterance_id”, “ctm_utt”: “en_4156”, “side”: “A”}

In practice, the dataset and manifest used for character encoding and byte pair encoding are exactly the same. The only difference lies in how the dataset tokenizes the text in the manifest.

Parameters

manifest_filepath – Path to manifest json as described above. Can be comma-separated paths.
tokenizer – A subclass of the Tokenizer wrapper found in the common collection, nemo.collections.common.tokenizers.TokenizerSpec. ASR Models support a subset of all available tokenizers.
sample_rate (int) – Sample rate to resample loaded audio to
int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.
augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio
max_duration – If audio exceeds this length, do not include in dataset
min_duration – If audio is less than this length, do not include in dataset
max_utts – Limit number of utterances
trim – Whether to trim silence segments
load_audio – Boolean flag indicate whether do or not load audio
add_misc – True if add additional info dict.
use_start_end_token – Boolean which dictates whether to add [BOS] and [EOS] tokens to beginning and ending of speech respectively.

property output_types¶: Returns definitions of module output ports.

class nemo.collections.asr.data.audio_to_text.TarredAudioToBPEDataset(*args: Any, **kwargs: Any)[source]¶

Bases: torch.utils.data., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization

A similar Dataset to the AudioToBPEDataset, but which loads tarred audio files.

Accepts a single comma-separated JSON manifest file (in the same style as for the AudioToBPEDataset), as well as the path(s) to the tarball(s) containing the wav files. Each line of the manifest should contain the information for one audio file, including at least the transcript and name of the audio file within the tarball.

Valid formats for the audio_tar_filepaths argument include: (1) a single string that can be brace-expanded, e.g. ‘path/to/audio.tar’ or ‘path/to/audio_{1..100}.tar.gz’, or (2) a list of file paths that will not be brace-expanded, e.g. [‘audio_1.tar’, ‘audio_2.tar’, …].

See the WebDataset documentation for more information about accepted data and input formats.

If using multiple workers the number of shards should be divisible by world_size to ensure an even split among workers. If it is not divisible, logging will give a warning but training will proceed. In addition, if using mutiprocessing, each shard MUST HAVE THE SAME NUMBER OF ENTRIES after filtering is applied. We currently do not check for this, but your program may hang if the shards are uneven!

Notice that a few arguments are different from the AudioToBPEDataset; for example, shuffle (bool) has been replaced by shuffle_n (int).

Additionally, please note that the len() of this DataLayer is assumed to be the length of the manifest after filtering. An incorrect manifest length may lead to some DataLoader issues down the line.

Parameters

audio_tar_filepaths – Either a list of audio tarball filepaths, or a string (can be brace-expandable).
manifest_filepath (str) – Path to the manifest.
tokenizer (TokenizerSpec) – Either a Word Piece Encoding tokenizer (BERT), or a Sentence Piece Encoding tokenizer (BPE). The CTC blank symbol is automatically added later for models using ctc.
sample_rate (int) – Sample rate to resample loaded audio to
int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.
augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio
shuffle_n (int) – How many samples to look ahead and load to be shuffled. See WebDataset documentation for more details. Defaults to 0.
min_duration (float) – Dataset parameter. All training files which have a duration less than min_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to 0.1.
max_duration (float) – Dataset parameter. All training files which have a duration more than max_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to None.
max_utts (int) – Limit number of utterances. 0 means no maximum.
trim (bool) – Whether to use trim silence from beginning and end of audio signal using librosa.effects.trim(). Defaults to False.
pad_id (id) – Token used to pad when collating samples in batches. If this is None, pads using 0s. Defaults to None.
shard_strategy (str) –
Tarred dataset shard distribution strategy chosen as a str value during ddp. - scatter: The default shard strategy applied by WebDataset, where each node gets

a unique set of shards, which are permanently pre-allocated and never changed at runtime.
- replicate: Optional shard strategy, where each node gets all of the set of shards available in the tarred dataset, which are permanently pre-allocated and never changed at runtime. The benefit of replication is that it allows each node to sample data points from the entire dataset independently of other nodes, and reduces dependence on value of shuffle_n.
  
  Note: Replicated strategy allows every node to sample the entire set of available tarfiles, and therefore more than one node may sample the same tarfile, and even sample the same data points! As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch.
global_rank (int) – Worker rank, used for partitioning shards. Defaults to 0.
world_size (int) – Total number of processes, used for partitioning shards. Defaults to 0.

Audio Preprocessors¶

class nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

Featurizer module that converts wavs to mel spectrograms. We don’t use torchaudio’s implementation here because the original implementation is not the same, so for the sake of backwards-compatibility this will use the old FilterbankFeatures for now. :param sample_rate: Sample rate of the input audio data.

Defaults to 16000

Parameters

window_size (float) – Size of window for fft in seconds Defaults to 0.02
window_stride (float) – Stride of window for fft in seconds Defaults to 0.01
n_window_size (int) – Size of window for fft in samples Defaults to None. Use one of window_size or n_window_size.
n_window_stride (int) – Stride of window for fft in samples Defaults to None. Use one of window_stride or n_window_stride.
window (str) – Windowing function for fft. can be one of [‘hann’, ‘hamming’, ‘blackman’, ‘bartlett’] Defaults to “hann”
normalize (str) – Can be one of [‘per_feature’, ‘all_features’]; all other options disable feature normalization. ‘all_features’ normalizes the entire spectrogram to be mean 0 with std 1. ‘pre_features’ normalizes per channel / freq instead. Defaults to “per_feature”
n_fft (int) – Length of FT window. If None, it uses the smallest power of 2 that is larger than n_window_size. Defaults to None
preemph (float) – Amount of pre emphasis to add to audio. Can be disabled by passing None. Defaults to 0.97
features (int) – Number of mel spectrogram freq bins to output. Defaults to 64
lowfreq (int) – Lower bound on mel basis in Hz. Defaults to 0
highfreq (int) – Lower bound on mel basis in Hz. Defaults to None
log (bool) – Log features. Defaults to True
log_zero_guard_type (str) – Need to avoid taking the log of zero. There are two options: “add” or “clamp”. Defaults to “add”.
log_zero_guard_value (float, or str) – Add or clamp requires the number to add with or clamp to. log_zero_guard_value can either be a float or “tiny” or “eps”. torch.finfo is used if “tiny” or “eps” is passed. Defaults to 2**-24.
dither (float) – Amount of white-noise dithering. Defaults to 1e-5
pad_to (int) – Ensures that the output size of the time dimension is a multiple of pad_to. Defaults to 16
frame_splicing (int) – Defaults to 1
stft_exact_pad (bool) – If True, uses pytorch_stft and convolutions with padding such that num_frames = num_samples / hop_length. If False, stft_conv will be used to determine how stft will be performed. Defaults to False
stft_conv (bool) – If True, uses pytorch_stft and convolutions. If False, uses torch.stft. Defaults to False
pad_value (float) – The value that shorter mels are padded with. Defaults to 0
mag_power (float) – The power that the linear spectrogram is raised to prior to multiplication with mel basis. Defaults to 2 for a power spec

property filter_banks¶

get_features(input_signal, length)[source]¶

property input_types¶: Returns definitions of module input ports.

property output_types¶

Returns definitions of module output ports. processed_signal:

0: AxisType(BatchTag) 1: AxisType(MelSpectrogramSignalTag) 2: AxisType(ProcessedTimeTag)

processed_length:: 0: AxisType(BatchTag)

classmethod restore_from(restore_path: str)[source]¶: Restores module/model with weights

save_to(save_path: str)[source]¶: Saves module/model with weights

class nemo.collections.asr.modules.AudioToMFCCPreprocessor(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

Preprocessor that converts wavs to MFCCs. Uses torchaudio.transforms.MFCC. :param sample_rate: The sample rate of the audio.

Defaults to 16000.

Parameters

window_size – Size of window for fft in seconds. Used to calculate the win_length arg for mel spectrogram. Defaults to 0.02
window_stride – Stride of window for fft in seconds. Used to caculate the hop_length arg for mel spect. Defaults to 0.01
n_window_size – Size of window for fft in samples Defaults to None. Use one of window_size or n_window_size.
n_window_stride – Stride of window for fft in samples Defaults to None. Use one of window_stride or n_window_stride.
window – Windowing function for fft. can be one of [‘hann’, ‘hamming’, ‘blackman’, ‘bartlett’, ‘none’, ‘null’]. Defaults to ‘hann’
n_fft – Length of FT window. If None, it uses the smallest power of 2 that is larger than n_window_size. Defaults to None
lowfreq (int) – Lower bound on mel basis in Hz. Defaults to 0
highfreq (int) – Lower bound on mel basis in Hz. Defaults to None
n_mels – Number of mel filterbanks. Defaults to 64
n_mfcc – Number of coefficients to retain Defaults to 64
dct_type – Type of discrete cosine transform to use
norm – Type of norm to use
log – Whether to use log-mel spectrograms instead of db-scaled. Defaults to True.

get_features(input_signal, length)[source]¶

property input_types¶: Returns definitions of module input ports.

property output_types¶: Returns definitions of module output ports.

classmethod restore_from(restore_path: str)[source]¶: Restores module/model with weights

save_to(save_path: str)[source]¶: Saves module/model with weights

Audio Augmentors¶

class nemo.collections.asr.modules.SpectrogramAugmentation(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

Performs time and freq cuts in one of two ways. SpecAugment zeroes out vertical and horizontal sections as described in SpecAugment (https://arxiv.org/abs/1904.08779). Arguments for use with SpecAugment are freq_masks, time_masks, freq_width, and time_width. SpecCutout zeroes out rectangulars as described in Cutout (https://arxiv.org/abs/1708.04552). Arguments for use with Cutout are rect_masks, rect_freq, and rect_time. :param freq_masks: how many frequency segments should be cut.

Defaults to 0.

Parameters

time_masks (int) – how many time segments should be cut Defaults to 0.
freq_width (int) – maximum number of frequencies to be cut in one segment. Defaults to 10.
time_width (int) – maximum number of time steps to be cut in one segment Defaults to 10.
rect_masks (int) – how many rectangular masks should be cut Defaults to 0.
rect_freq (int) – maximum size of cut rectangles along the frequency dimension Defaults to 5.
rect_time (int) – maximum size of cut rectangles along the time dimension Defaults to 25.

forward(input_spec)[source]¶

property input_types¶: Returns definitions of module input types

property output_types¶: Returns definitions of module output types

classmethod restore_from(restore_path: str)[source]¶: Restores module/model with weights

save_to(save_path: str)[source]¶: Saves module/model with weights

class nemo.collections.asr.modules.CropOrPadSpectrogramAugmentation(*args: Any, **kwargs: Any)[source]¶

Bases: torch.nn., nemo.core.classes.common.Typing, nemo.core.classes.common.Serialization, nemo.core.classes.common.FileIO

Pad or Crop the incoming Spectrogram to a certain shape. :param audio_length: the final number of timesteps that is required.

The signal will be either padded or cropped temporally to this size.

forward¶

property input_types¶: Returns definitions of module output ports.

property output_types¶: Returns definitions of module output ports.

classmethod restore_from(restore_path: str)[source]¶: Restores module/model with weights

save_to(save_path: str)[source]¶: Saves module/model with weights

class nemo.collections.asr.parts.perturb.SpeedPerturbation(sr, resample_type, min_speed_rate=0.9, max_speed_rate=1.1, num_rates=5, rng=None)[source]¶

Bases: nemo.collections.asr.parts.perturb.Perturbation

Performs Speed Augmentation by re-sampling the data to a different sampling rate, which does not preserve pitch.

Note: This is a very slow operation for online augmentation. If space allows, it is preferable to pre-compute and save the files to augment the dataset.

Parameters

sr – Original sampling rate.
resample_type – Type of resampling operation that will be performed. For better speed using resampy’s fast resampling method, use resample_type=’kaiser_fast’. For high-quality resampling, set resample_type=’kaiser_best’. To use scipy.signal.resample, set resample_type=’fft’ or resample_type=’scipy’
min_speed_rate – Minimum sampling rate modifier.
max_speed_rate – Maximum sampling rate modifier.
num_rates – Number of discrete rates to allow. Can be a positive or negative integer. If a positive integer greater than 0 is provided, the range of speed rates will be discretized into num_rates values. If a negative integer or 0 is provided, the full range of speed rates will be sampled uniformly. Note: If a positive integer is provided and the resultant discretized range of rates contains the value ‘1.0’, then those samples with rate=1.0, will not be augmented at all and simply skipped. This is to unnecessary augmentation and increase computation time. Effective augmentation chance in such a case is = prob * (num_rates - 1 / num_rates) * 100`% chance where `prob is the global probability of a sample being augmented.
rng – Random seed number.

max_augmentation_length(length)[source]¶

perturb(data)[source]¶

class nemo.collections.asr.parts.perturb.TimeStretchPerturbation(min_speed_rate=0.9, max_speed_rate=1.1, num_rates=5, n_fft=512, rng=None)[source]¶

Bases: nemo.collections.asr.parts.perturb.Perturbation

Time-stretch an audio series by a fixed rate while preserving pitch, based on [1, 2].

Note: This is a simplified implementation, intended primarily for reference and pedagogical purposes. It makes no attempt to handle transients, and is likely to produce audible artifacts.

Reference [1] [Ellis, D. P. W. “A phase vocoder in Matlab.” Columbia University, 2002.] (http://www.ee.columbia.edu/~dpwe/resources/matlab/pvoc/) [2] [librosa.effects.time_stretch] (https://librosa.github.io/librosa/generated/librosa.effects.time_stretch.html)

Parameters

min_speed_rate – Minimum sampling rate modifier.
max_speed_rate – Maximum sampling rate modifier.
num_rates – Number of discrete rates to allow. Can be a positive or negative integer. If a positive integer greater than 0 is provided, the range of speed rates will be discretized into num_rates values. If a negative integer or 0 is provided, the full range of speed rates will be sampled uniformly. Note: If a positive integer is provided and the resultant discretized range of rates contains the value ‘1.0’, then those samples with rate=1.0, will not be augmented at all and simply skipped. This is to avoid unnecessary augmentation and increase computation time. Effective augmentation chance in such a case is = prob * (num_rates - 1 / num_rates) * 100`% chance where `prob is the global probability of a sample being augmented.
n_fft – Number of fft filters to be computed.
rng – Random seed number.

max_augmentation_length(length)[source]¶

perturb(data)[source]¶

class nemo.collections.asr.parts.perturb.GainPerturbation(min_gain_dbfs=- 10, max_gain_dbfs=10, rng=None)[source]¶

Bases: nemo.collections.asr.parts.perturb.Perturbation

Applies random gain to the audio.

Parameters

min_gain_dbfs (float) – Min gain level in dB
max_gain_dbfs (float) – Max gain level in dB
rng – Random number generator

perturb(data)[source]¶

class nemo.collections.asr.parts.perturb.ImpulsePerturbation(manifest_path=None, rng=None, audio_tar_filepaths=None, shuffle_n=128, shift_impulse=False)[source]¶

Bases: nemo.collections.asr.parts.perturb.Perturbation

Convolves audio with a Room Impulse Response.

Parameters

manifest_path (list) – Manifest file for RIRs
audio_tar_filepaths (list) – Tar files, if RIR audio files are tarred
shuffle_n (int) – Shuffle parameter for shuffling buffered files from the tar files
shift_impulse (bool) – Shift impulse response to adjust for delay at the beginning

perturb(data)[source]¶

class nemo.collections.asr.parts.perturb.ShiftPerturbation(min_shift_ms=- 5.0, max_shift_ms=5.0, rng=None)[source]¶

Bases: nemo.collections.asr.parts.perturb.Perturbation

Perturbs audio by shifting the audio in time by a random amount between min_shift_ms and max_shift_ms. The final length of the audio is kept unaltered by padding the audio with zeros.

Parameters

min_shift_ms (float) – Minimum time in milliseconds by which audio will be shifted
max_shift_ms (float) – Maximum time in milliseconds by which audio will be shifted
rng – Random number generator

perturb(data)[source]¶

class nemo.collections.asr.parts.perturb.NoisePerturbation(manifest_path=None, min_snr_db=10, max_snr_db=50, max_gain_db=300.0, rng=None, audio_tar_filepaths=None, shuffle_n=100, orig_sr=16000)[source]¶

Bases: nemo.collections.asr.parts.perturb.Perturbation

Perturbation that adds noise to input audio.

Parameters

manifest_path (str) – Manifest file with paths to noise files
min_snr_db (float) – Minimum SNR of audio after noise is added
max_snr_db (float) – Maximum SNR of audio after noise is added
max_gain_db (float) – Maximum gain that can be applied on the noise sample
audio_tar_filepaths (list) – Tar files, if noise audio files are tarred
shuffle_n (int) – Shuffle parameter for shuffling buffered files from the tar files
orig_sr (int) – Original sampling rate of the noise files
rng – Random number generator

get_one_noise_sample(target_sr)[source]¶

property orig_sr¶

perturb(data)[source]¶

perturb_with_foreground_noise(data, noise, data_rms=None, max_noise_dur=2, max_additions=1)[source]¶

perturb_with_input_noise(data, noise, data_rms=None)[source]¶

class nemo.collections.asr.parts.perturb.WhiteNoisePerturbation(min_level=- 90, max_level=- 46, rng=None)[source]¶

Bases: nemo.collections.asr.parts.perturb.Perturbation

Perturbation that adds white noise to an audio file in the training dataset.

Parameters

min_level (int) – Minimum level in dB at which white noise should be added
max_level (int) – Maximum level in dB at which white noise should be added
rng – Random number generator

perturb(data)[source]¶

class nemo.collections.asr.parts.perturb.RirAndNoisePerturbation(rir_manifest_path=None, rir_prob=0.5, noise_manifest_paths=None, min_snr_db=0, max_snr_db=50, rir_tar_filepaths=None, rir_shuffle_n=100, noise_tar_filepaths=None, apply_noise_rir=False, orig_sample_rate=None, max_additions=5, max_duration=2.0, bg_noise_manifest_paths=None, bg_min_snr_db=10, bg_max_snr_db=50, bg_noise_tar_filepaths=None, bg_orig_sample_rate=None)[source]¶

Bases: nemo.collections.asr.parts.perturb.Perturbation

RIR augmentation with additive foreground and background noise. In this implementation audio data is augmented by first convolving the audio with a Room Impulse Response and then adding foreground noise and background noise at various SNRs. RIR, foreground and background noises should either be supplied with a manifest file or as tarred audio files (faster).

Different sets of noise audio files based on the original sampling rate of the noise. This is useful while training a mixed sample rate model. For example, when training a mixed model with 8 kHz and 16 kHz audio with a target sampling rate of 16 kHz, one would want to augment 8 kHz data with 8 kHz noise rather than 16 kHz noise.

Parameters

rir_manifest_path – Manifest file for RIRs
rir_tar_filepaths – Tar files, if RIR audio files are tarred
rir_prob – Probability of applying a RIR
noise_manifest_paths – Foreground noise manifest path
min_snr_db – Min SNR for foreground noise
max_snr_db – Max SNR for background noise,
noise_tar_filepaths – Tar files, if noise files are tarred
apply_noise_rir – Whether to convolve foreground noise with a a random RIR
orig_sample_rate – Original sampling rate of foreground noise audio
max_additions – Max number of times foreground noise is added to an utterance,
max_duration – Max duration of foreground noise
bg_noise_manifest_paths – Background noise manifest path
bg_min_snr_db – Min SNR for background noise
bg_max_snr_db – Max SNR for background noise
bg_noise_tar_filepaths – Tar files, if noise files are tarred
bg_orig_sample_rate – Original sampling rate of background noise audio

perturb(data)[source]¶

class nemo.collections.asr.parts.perturb.TranscodePerturbation(rng=None)[source]¶

Bases: nemo.collections.asr.parts.perturb.Perturbation

Audio codec augmentation. This implementation uses sox to transcode audio with low rate audio codecs, so users need to make sure that the installed sox version supports the codecs used here (G711 and amr-nb).

Parameters: rng – Random number generator

perturb(data)[source]¶

Miscellaneous Classes¶

RNNT Decoding¶

class nemo.collections.asr.metrics.rnnt_wer.RNNTDecoding(decoding_cfg, decoder, joint, vocabulary)[source]¶

Bases: nemo.collections.asr.metrics.rnnt_wer.AbstractRNNTDecoding

Used for performing RNN-T auto-regressive decoding of the Decoder+Joint network given the encoder state.

Parameters

decoding_cfg –
A dict-like object which contains the following key-value pairs. strategy: str value which represents the type of decoding that can occur.

Possible values are : - greedy, greedy_batch (for greedy decoding). - beam, tsd, alsd (for beam search decoding).

compute_hypothesis_token_set: A bool flag, which determines whether to compute a list of decoded
tokens as well as the decoded string. Default is False in order to avoid double decoding unless required.

The config may further contain the following sub-dictionaries: “greedy”:

max_symbols: int, describing the maximum number of target tokens to decode per
timestep during greedy decoding. Setting to larger values allows longer sentences to be decoded, at the cost of increased execution time.

”beam”:

beam_size: int, defining the beam size for beam search. Must be >= 1.
If beam_size == 1, will perform cached greedy search. This might be slightly different results compared to the greedy search above.

score_norm: optional bool, whether to normalize the returned beam score in the hypotheses.
Set to True by default.

return_best_hypothesis: optional bool, whether to return just the best hypothesis or all of the
hypotheses after beam search has concluded. This flag is set by default.

tsd_max_sym_exp: optional int, determines number of symmetric expansions of the target symbols
per timestep of the acoustic model. Larger values will allow longer sentences to be decoded, at increased cost to execution time.

alsd_max_target_len: optional int or float, determines the potential maximum target sequence length.
If an integer is provided, it can decode sequences of that particular maximum length. If a float is provided, it can decode sequences of int(alsd_max_target_len * seq_len), where seq_len is the length of the acoustic model output (T).

NOTE:
If a float is provided, it can be greater than 1! By default, a float of 2.0 is used so that a target sequence can be at most twice as long as the acoustic model output length T.
decoder – The Decoder/Prediction network module.
joint – The Joint network module.
vocabulary – The vocabulary (excluding the RNNT blank token) which will be used for decoding.

decode_ids_to_tokens(tokens: List[int]) → List[str][source]¶

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters: tokens – List of int representing the token ids.
Returns: A list of decoded tokens.

decode_tokens_to_str(tokens: List[int]) → str[source]¶

Implemented by subclass in order to decoder a token list into a string.

Parameters: tokens – List of int representing the token ids.
Returns: A decoded string.

class nemo.collections.asr.metrics.rnnt_wer_bpe.RNNTBPEDecoding(decoding_cfg, decoder, joint, tokenizer: nemo.collections.common.tokenizers.tokenizer_spec.TokenizerSpec)[source]¶

Bases: nemo.collections.asr.metrics.rnnt_wer.AbstractRNNTDecoding

Used for performing RNN-T auto-regressive decoding of the Decoder+Joint network given the encoder state.

Parameters

decoding_cfg –
A dict-like object which contains the following key-value pairs. strategy: str value which represents the type of decoding that can occur.

Possible values are : - greedy, greedy_batch (for greedy decoding). - beam, tsd, alsd (for beam search decoding).

compute_hypothesis_token_set: A bool flag, which determines whether to compute a list of decoded
tokens as well as the decoded string. Default is False in order to avoid double decoding unless required.

The config may further contain the following sub-dictionaries: “greedy”:

max_symbols: int, describing the maximum number of target tokens to decode per
timestep during greedy decoding. Setting to larger values allows longer sentences to be decoded, at the cost of increased execution time.

”beam”:

beam_size: int, defining the beam size for beam search. Must be >= 1.
If beam_size == 1, will perform cached greedy search. This might be slightly different results compared to the greedy search above.

score_norm: optional bool, whether to normalize the returned beam score in the hypotheses.
Set to True by default.

return_best_hypothesis: optional bool, whether to return just the best hypothesis or all of the
hypotheses after beam search has concluded.

tsd_max_sym_exp: optional int, determines number of symmetric expansions of the target symbols
per timestep of the acoustic model. Larger values will allow longer sentences to be decoded, at increased cost to execution time.

alsd_max_target_len: optional int or float, determines the potential maximum target sequence length.
If an integer is provided, it can decode sequences of that particular maximum length. If a float is provided, it can decode sequences of int(alsd_max_target_len * seq_len), where seq_len is the length of the acoustic model output (T).

NOTE:
If a float is provided, it can be greater than 1! By default, a float of 2.0 is used so that a target sequence can be at most twice as long as the acoustic model output length T.
decoder – The Decoder/Prediction network module.
joint – The Joint network module.
tokenizer – The tokenizer which will be used for decoding.

decode_ids_to_tokens(tokens: List[int]) → List[str][source]¶

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters: tokens – List of int representing the token ids.
Returns: A list of decoded tokens.

decode_tokens_to_str(tokens: List[int]) → str[source]¶

Implemented by subclass in order to decoder a token list into a string.

Parameters: tokens – List of int representing the token ids.
Returns: A decoded string.

class nemo.collections.asr.parts.rnnt_greedy_decoding.GreedyRNNTInfer(decoder_model: nemo.collections.asr.modules.rnnt_abstract.AbstractRNNTDecoder, joint_model: nemo.collections.asr.modules.rnnt_abstract.AbstractRNNTJoint, blank_index: int, max_symbols_per_step: Optional[int] = None)[source]¶

Bases: nemo.collections.asr.parts.rnnt_greedy_decoding._GreedyRNNTInfer

A greedy transducer decoder.

Sequence level greedy decoding, performed auto-repressively.

Parameters

decoder_model – rnnt_utils.AbstractRNNTDecoder implementation.
joint_model – rnnt_utils.AbstractRNNTJoint implementation.
blank_index – int index of the blank token. Can be 0 or len(vocabulary).
max_symbols_per_step – Optional int. The maximum number of symbols that can be added to a sequence in a single time step; if set to None then there is no limit.

forward(encoder_output: torch.Tensor, encoded_lengths: torch.Tensor)[source]¶

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-repressively.

Parameters

encoder_output – A tensor of size (batch, features, timesteps).
encoded_lengths – list of int representing the length of each sequence output sequence.

Returns

packed list containing batch number of sentences (Hypotheses).

class nemo.collections.asr.parts.rnnt_greedy_decoding.GreedyBatchedRNNTInfer(decoder_model: nemo.collections.asr.modules.rnnt_abstract.AbstractRNNTDecoder, joint_model: nemo.collections.asr.modules.rnnt_abstract.AbstractRNNTJoint, blank_index: int, max_symbols_per_step: Optional[int] = None)[source]¶

Bases: nemo.collections.asr.parts.rnnt_greedy_decoding._GreedyRNNTInfer

A batch level greedy transducer decoder.

Batch level greedy decoding, performed auto-repressively.

Parameters

decoder_model – rnnt_utils.AbstractRNNTDecoder implementation.
joint_model – rnnt_utils.AbstractRNNTJoint implementation.
blank_index – int index of the blank token. Can be 0 or len(vocabulary).
max_symbols_per_step – Optional int. The maximum number of symbols that can be added to a sequence in a single time step; if set to None then there is no limit.

forward(encoder_output: torch.Tensor, encoded_lengths: torch.Tensor)[source]¶

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-repressively.

Parameters

encoder_output – A tensor of size (batch, features, timesteps).
encoded_lengths – list of int representing the length of each sequence output sequence.

Returns

packed list containing batch number of sentences (Hypotheses).

class nemo.collections.asr.parts.rnnt_beam_decoding.BeamRNNTInfer(decoder_model: nemo.collections.asr.modules.rnnt_abstract.AbstractRNNTDecoder, joint_model: nemo.collections.asr.modules.rnnt_abstract.AbstractRNNTJoint, beam_size: int, search_type: str = 'default', score_norm: bool = True, return_best_hypothesis: bool = True, tsd_max_sym_exp_per_step: Optional[int] = 50, alsd_max_target_len: Union[int, float] = 1.0, nsc_max_timesteps_expansion: int = 1, nsc_prefix_alpha: int = 1)[source]¶

Bases: nemo.core.classes.common.Typing

align_length_sync_decoding(h: torch.Tensor, encoded_lengths: torch.Tensor) → List[nemo.collections.asr.parts.rnnt_utils.Hypothesis][source]¶

Alignment-length synchronous beam search implementation. Based on https://ieeexplore.ieee.org/document/9053040

Parameters: h – Encoded speech features (1, T_max, D_enc)
Returns: N-best decoding results
Return type: nbest_hyps

default_beam_search(h: torch.Tensor, encoded_lengths: torch.Tensor) → List[nemo.collections.asr.parts.rnnt_utils.Hypothesis][source]¶

Beam search implementation.

Parameters: x – Encoded speech features (1, T_max, D_enc)
Returns: N-best decoding results
Return type: nbest_hyps

greedy_search(h: torch.Tensor, encoded_lengths: torch.Tensor) → List[nemo.collections.asr.parts.rnnt_utils.Hypothesis][source]¶

Greedy search implementation for transducer. Generic case when beam size = 1. Results might differ slightly due to implementation details as compared to GreedyRNNTInfer and GreedyBatchRNNTInfer.

Parameters: h – Encoded speech features (1, T_max, D_enc)
Returns: 1-best decoding results
Return type: hyp

property input_types¶: Returns definitions of module input ports.

property output_types¶: Returns definitions of module output ports.

recombine_hypotheses(hypotheses: List[nemo.collections.asr.parts.rnnt_utils.Hypothesis]) → List[nemo.collections.asr.parts.rnnt_utils.Hypothesis][source]¶

Recombine hypotheses with equivalent output sequence.

Parameters: hypotheses (list) – list of hypotheses
Returns: list of recombined hypotheses
Return type: final (list)

sort_nbest(hyps: List[nemo.collections.asr.parts.rnnt_utils.Hypothesis]) → List[nemo.collections.asr.parts.rnnt_utils.Hypothesis][source]¶

Sort hypotheses by score or score given sequence length.

Parameters: hyps – list of hypotheses
Returns: sorted list of hypotheses
Return type: hyps

time_sync_decoding(h: torch.Tensor, encoded_lengths: torch.Tensor) → List[nemo.collections.asr.parts.rnnt_utils.Hypothesis][source]¶

Time synchronous beam search implementation. Based on https://ieeexplore.ieee.org/document/9053040

Parameters: h – Encoded speech features (1, T_max, D_enc)
Returns: N-best decoding results
Return type: nbest_hyps

Hypotheses¶

class nemo.collections.asr.parts.rnnt_utils.Hypothesis(score: float, y_sequence: Union[List[int], torch.Tensor], dec_state: Optional[Union[List[List[torch.Tensor]], List[torch.Tensor]]] = None, y: Optional[List[torch.tensor]] = None, lm_state: Optional[Union[Dict[str, Any], List[Any]]] = None, lm_scores: Optional[torch.Tensor] = None, tokens: Optional[Union[List[int], torch.Tensor]] = None, text: Optional[str] = None, timestep: Union[List[int], torch.Tensor] = <factory>, length: int = 0)[source]¶

Bases: object

Hypothesis class for beam search algorithms.

Parameters

score – A float score obtained from an AbstractRNNTDecoder module’s score_hypothesis method.
y_sequence – Either a sequence of integer ids pointing to some vocabulary, or a packed torch.Tensor behaving in the same manner. dtype must be torch.Long in the latter case.
dec_state – A list (or list of list) of LSTM-RNN decoder states. Can be None.
y – (Unused) A list of torch.Tensors representing the list of hypotheses.
lm_state – (Unused) A dictionary state cache used by an external Language Model.
lm_scores – (Unused) Score of the external Language Model.
tokens – (Optional) List of decoded tokens.
text – Decoded transcript of the acoustic input.
timestep – (Optional) List of int timesteps where tokens were predicted.
length – (Optional) int which represents the length of the decoded tokens / text.

class nemo.collections.asr.parts.rnnt_utils.NBestHypotheses(n_best_hypotheses: Optional[List[nemo.collections.asr.parts.rnnt_utils.Hypothesis]])[source]¶

Bases: object

List of N best hypotheses

Parameters: n_best_hypotheses – An optional list of Hypothesis objects.