NeMo ASR collection API#

Model Classes#

Modules#

Parts#

class nemo.collections.asr.parts.submodules.jasper.JasperBlock(*args: Any, **kwargs: Any)[source]#

Bases: Module, AdapterModuleMixin, AccessMixin

Constructs a single “Jasper” block. With modified parameters, also constructs other blocks for models such as QuartzNet and Citrinet.

  • For Jasper : separable flag should be False

  • For QuartzNet : separable flag should be True

  • For Citrinet : separable flag and se flag should be True

Note that above are general distinctions, each model has intricate differences that expand over multiple such blocks.

For further information about the differences between models which use JasperBlock, please review the configs for ASR models found in the ASR examples directory.

Parameters
  • inplanes – Number of input channels.

  • planes – Number of output channels.

  • repeat – Number of repeated sub-blocks (R) for this block.

  • kernel_size – Convolution kernel size across all repeated sub-blocks.

  • kernel_size_factor – Floating point scale value that is multiplied with kernel size, then rounded down to nearest odd integer to compose the kernel size. Defaults to 1.0.

  • stride – Stride of the convolutional layers.

  • dilation – Integer which defined dilation factor of kernel. Note that when dilation > 1, stride must be equal to 1.

  • padding – String representing type of padding. Currently only supports “same” padding, which symmetrically pads the input tensor with zeros.

  • dropout – Floating point value, determins percentage of output that is zeroed out.

  • activation – String representing activation functions. Valid activation functions are : {“hardtanh”: nn.Hardtanh, “relu”: nn.ReLU, “selu”: nn.SELU, “swish”: Swish}. Defaults to “relu”.

  • residual – Bool that determined whether a residual branch should be added or not. All residual branches are constructed using a pointwise convolution kernel, that may or may not perform strided convolution depending on the parameter residual_mode.

  • groups – Number of groups for Grouped Convolutions. Defaults to 1.

  • separable – Bool flag that describes whether Time-Channel depthwise separable convolution should be constructed, or ordinary convolution should be constructed.

  • heads – Number of “heads” for the masked convolution. Defaults to -1, which disables it.

  • normalization – String that represents type of normalization performed. Can be one of “batch”, “group”, “instance” or “layer” to compute BatchNorm1D, GroupNorm1D, InstanceNorm or LayerNorm (which are special cases of GroupNorm1D).

  • norm_groups – Number of groups used for GroupNorm (if normalization == “group”).

  • residual_mode – String argument which describes whether the residual branch should be simply added (“add”) or should first stride, then add (“stride_add”). Required when performing stride on parallel branch as well as utilizing residual add.

  • residual_panes – Number of residual panes, used for Jasper-DR models. Please refer to the paper.

  • conv_mask – Bool flag which determines whether to utilize masked convolutions or not. In general, it should be set to True.

  • se – Bool flag that determines whether Squeeze-and-Excitation layer should be used.

  • se_reduction_ratio – Integer value, which determines to what extend the hidden dimension of the SE intermediate step should be reduced. Larger values reduce number of parameters, but also limit the effectiveness of SE layers.

  • se_context_window – Integer value determining the number of timesteps that should be utilized in order to compute the averaged context window. Defaults to -1, which means it uses global context - such that all timesteps are averaged. If any positive integer is used, it will utilize limited context window of that size.

  • se_interpolation_mode – String used for interpolation mode of timestep dimension for SE blocks. Used only if context window is > 1. The modes available for resizing are: nearest, linear (3D-only), bilinear, area.

  • stride_last – Bool flag that determines whether all repeated blocks should stride at once, (stride of S^R when this flag is False) or just the last repeated block should stride (stride of S when this flag is True).

  • future_context

    Int value that determins how many “right” / “future” context frames will be utilized when calculating the output of the conv kernel. All calculations are done for odd kernel sizes only.

    By default, this is -1, which is recomputed as the symmetric padding case.

    When future_context >= 0, will compute the asymmetric padding as follows : (left context, right context) = [K - 1 - future_context, future_context]

    Determining an exact formula to limit future context is dependent on global layout of the model. As such, we provide both “local” and “global” guidelines below.

    Local context limit (should always be enforced) - future context should be <= half the kernel size for any given layer - future context > kernel size defaults to symmetric kernel - future context of layer = number of future frames * width of each frame (dependent on stride)

    Global context limit (should be carefully considered) - future context should be layed out in an ever reducing pattern. Initial layers should restrict future context less than later layers, since shallow depth (and reduced stride) means each frame uses less amounts of future context. - Beyond a certain point, future context should remain static for a given stride level. This is the upper bound of the amount of future context that can be provided to the model on a global scale. - future context is calculated (roughly) as - (2 ^ stride) * (K // 2) number of future frames. This resultant value should be bound to some global maximum number of future seconds of audio (in ms).

    Note: In the special case where K < future_context, it is assumed that the kernel is too small to limit its future context, so symmetric padding is used instead.

    Note: There is no explicit limitation on the amount of future context used, as long as K > future_context constraint is maintained. This might lead to cases where future_context is more than half the actual kernel size K! In such cases, the conv layer is utilizing more of the future context than its current and past context to compute the output. While this is possible to do, it is not recommended and the layer will raise a warning to notify the user of such cases. It is advised to simply use symmetric padding for such cases.

    Example: Say we have a model that performs 8x stride and receives spectrogram frames with stride of 0.01s. Say we wish to upper bound future context to 80 ms.

    Layer ID, Kernel Size, Stride, Future Context, Global Context 0, K=5, S=1, FC=8, GC= 2 * (2^0) = 2 * 0.01 ms (special case, K < FC so use symmetric pad) 1, K=7, S=1, FC=3, GC= 3 * (2^0) = 3 * 0.01 ms (note that symmetric pad here uses 3 FC frames!) 2, K=11, S=2, FC=4, GC= 4 * (2^1) = 8 * 0.01 ms (note that symmetric pad here uses 5 FC frames!) 3, K=15, S=1, FC=4, GC= 4 * (2^1) = 8 * 0.01 ms (note that symmetric pad here uses 7 FC frames!) 4, K=21, S=2, FC=2, GC= 2 * (2^2) = 8 * 0.01 ms (note that symmetric pad here uses 10 FC frames!) 5, K=25, S=2, FC=1, GC= 1 * (2^3) = 8 * 0.01 ms (note that symmetric pad here uses 14 FC frames!) 6, K=29, S=1, FC=1, GC= 1 * (2^3) = 8 * 0.01 ms …

  • quantize – Bool flag whether to quantize the Convolutional blocks.

  • layer_idx (int, optional) – can be specified to allow layer output capture for InterCTC loss. Defaults to -1.

forward(input_: Tuple[List[torch.Tensor], Optional[torch.Tensor]]) Tuple[List[torch.Tensor], Optional[torch.Tensor]][source]#

Forward pass of the module.

Parameters

input – The input is a tuple of two values - the preprocessed audio signal as well as the lengths of the audio signal. The audio signal is padded to the shape [B, D, T] and the lengths are a torch vector of length B.

Returns

The output of the block after processing the input through repeat number of sub-blocks, as well as the lengths of the encoded audio after padding/striding.

Mixins#

class nemo.collections.asr.parts.mixins.interctc_mixin.InterCTCMixin[source]#

Bases: object

Adds utilities for computing interCTC loss from https://arxiv.org/abs/2102.03216.

To use, make sure encoder accesses interctc['capture_layers'] property in the AccessMixin and registers interctc/layer_output_X and interctc/layer_length_X for all layers that we want to get loss from. Additionally, specify the following config parameters to set up loss:

interctc:
    # can use different values
    loss_weights: [0.3]
    apply_at_layers: [8]

Then call

  • self.setup_interctc(ctc_decoder_name, ctc_loss_name, ctc_wer_name) in the init method

  • self.add_interctc_losses after computing regular loss.

  • self.finalize_interctc_metrics(metrics, outputs, prefix="val_") in the multi_validation_epoch_end method.

  • self.finalize_interctc_metrics(metrics, outputs, prefix="test_") in the multi_test_epoch_end method.

add_interctc_losses(loss_value: torch.Tensor, transcript: torch.Tensor, transcript_len: torch.Tensor, compute_wer: bool, compute_loss: bool = True, log_wer_num_denom: bool = False, log_prefix: str = '') Tuple[Optional[torch.Tensor], Dict][source]#

Adding interCTC losses if required.

Will also register loss/wer metrics in the returned dictionary.

Parameters
  • loss_value (torch.Tensor) – regular loss tensor (will add interCTC loss to it).

  • transcript (torch.Tensor) – current utterance transcript.

  • transcript_len (torch.Tensor) – current utterance transcript length.

  • compute_wer (bool) – whether to compute WER for the current utterance. Should typically be True for validation/test and only True for training if current batch WER should be logged.

  • compute_loss (bool) – whether to compute loss for the current utterance. Should always be True in training and almost always True in validation, unless all other losses are disabled as well. Defaults to True.

  • log_wer_num_denom (bool) – if True, will additionally log WER num/denom in the returned metrics dictionary. Should always be True for validation/test to allow correct metrics aggregation. Should always be False for training. Defaults to False.

  • log_prefix (str) – prefix added to all log values. Should be "" for training and "val_" for validation. Defaults to “”.

Returns

tuple of new loss tensor and dictionary with logged metrics.

Return type

tuple[Optional[torch.Tensor], Dict]

finalize_interctc_metrics(metrics: Dict, outputs: List[Dict], prefix: str)[source]#

Finalizes InterCTC WER and loss metrics for logging purposes.

Should be called inside multi_validation_epoch_end (with prefix="val_") or multi_test_epoch_end (with prefix="test_").

Note that metrics dictionary is going to be updated in-place.

get_captured_interctc_tensors() List[Tuple[torch.Tensor, torch.Tensor]][source]#

Returns a list of captured tensors from encoder: tuples of (output, length).

Will additionally apply ctc_decoder to the outputs.

get_interctc_param(param_name)[source]#

Either directly get parameter from self._interctc_params or call getattr with the corresponding name.

is_interctc_enabled() bool[source]#

Returns whether interCTC loss is enabled.

set_interctc_enabled(enabled: bool)[source]#

Can be used to enable/disable InterCTC manually.

set_interctc_param(param_name, param_value)[source]#

Setting the parameter to the self._interctc_params dictionary.

Raises an error if trying to set decoder, loss or wer as those should always come from the main class.

setup_interctc(decoder_name, loss_name, wer_name)[source]#

Sets up all interctc-specific parameters and checks config consistency.

Caller has to specify names of attributes to perform CTC-specific WER, decoder and loss computation. They will be looked up in the class state with getattr.

The reason we get the names and look up object later is because those objects might change without re-calling the setup of this class. So we always want to look up the most up-to-date object instead of “caching” it here.

Datasets#

Character Encoding Datasets#

class nemo.collections.asr.data.audio_to_text.AudioToCharDataset(*args: Any, **kwargs: Any)[source]#

Bases: _AudioTextDataset

Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations (in seconds). Each new line is a different sample. Example below: {“audio_filepath”: “/path/to/audio.wav”, “text_filepath”: “/path/to/audio.txt”, “duration”: 23.147} … {“audio_filepath”: “/path/to/audio.wav”, “text”: “the transcription”, “offset”: 301.75, “duration”: 0.82, “utt”: “utterance_id”, “ctm_utt”: “en_4156”, “side”: “A”}

Parameters
  • manifest_filepath – Path to manifest json as described above. Can be comma-separated paths.

  • labels – String containing all the possible characters to map to

  • sample_rate (int) – Sample rate to resample loaded audio to

  • int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.

  • augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio

  • max_duration – If audio exceeds this length, do not include in dataset

  • min_duration – If audio is less than this length, do not include in dataset

  • max_utts – Limit number of utterances

  • blank_index – blank character index, default = -1

  • unk_index – unk_character index, default = -1

  • normalize – whether to normalize transcript text (default): True

  • bos_id – Id of beginning of sequence symbol to append if not None

  • eos_id – Id of end of sequence symbol to append if not None

  • return_sample_id (bool) – whether to return the sample_id as a part of each sample

  • channel_selector (int | Iterable[int] | str) – select a single channel or a subset of channels from multi-channel audio. If set to ‘average’, it performs averaging across channels. Disabled if set to None. Defaults to None. Uses zero-based indexing.

property output_types: Optional[Dict[str, NeuralType]]#

Returns definitions of module output ports.

class nemo.collections.asr.data.audio_to_text.TarredAudioToCharDataset(*args: Any, **kwargs: Any)[source]#

Bases: _TarredAudioToTextDataset

A similar Dataset to the AudioToCharDataset, but which loads tarred audio files.

Accepts a single comma-separated JSON manifest file (in the same style as for the AudioToCharDataset), as well as the path(s) to the tarball(s) containing the wav files. Each line of the manifest should contain the information for one audio file, including at least the transcript and name of the audio file within the tarball.

Valid formats for the audio_tar_filepaths argument include: (1) a single string that can be brace-expanded, e.g. ‘path/to/audio.tar’ or ‘path/to/audio_{1..100}.tar.gz’, or (2) a list of file paths that will not be brace-expanded, e.g. [‘audio_1.tar’, ‘audio_2.tar’, …].

See the WebDataset documentation for more information about accepted data and input formats.

If using multiple workers the number of shards should be divisible by world_size to ensure an even split among workers. If it is not divisible, logging will give a warning but training will proceed. In addition, if using mutiprocessing, each shard MUST HAVE THE SAME NUMBER OF ENTRIES after filtering is applied. We currently do not check for this, but your program may hang if the shards are uneven!

Notice that a few arguments are different from the AudioToCharDataset; for example, shuffle (bool) has been replaced by shuffle_n (int).

Additionally, please note that the len() of this DataLayer is assumed to be the length of the manifest after filtering. An incorrect manifest length may lead to some DataLoader issues down the line.

Parameters
  • audio_tar_filepaths – Either a list of audio tarball filepaths, or a string (can be brace-expandable).

  • manifest_filepath (str) – Path to the manifest.

  • labels (list) – List of characters that can be output by the ASR model. For Jasper, this is the 28 character set {a-z ‘}. The CTC blank symbol is automatically added later for models using ctc.

  • sample_rate (int) – Sample rate to resample loaded audio to

  • int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.

  • augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio

  • shuffle_n (int) – How many samples to look ahead and load to be shuffled. See WebDataset documentation for more details. Defaults to 0.

  • min_duration (float) – Dataset parameter. All training files which have a duration less than min_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to 0.1.

  • max_duration (float) – Dataset parameter. All training files which have a duration more than max_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to None.

  • blank_index (int) – Blank character index, defaults to -1.

  • unk_index (int) – Unknown character index, defaults to -1.

  • normalize (bool) – Dataset parameter. Whether to use automatic text cleaning. It is highly recommended to manually clean text for best results. Defaults to True.

  • trim (bool) – Whether to use trim silence from beginning and end of audio signal using librosa.effects.trim(). Defaults to False.

  • bos_id (id) – Dataset parameter. Beginning of string symbol id used for seq2seq models. Defaults to None.

  • eos_id (id) – Dataset parameter. End of string symbol id used for seq2seq models. Defaults to None.

  • pad_id (id) – Token used to pad when collating samples in batches. If this is None, pads using 0s. Defaults to None.

  • shard_strategy (str) –

    Tarred dataset shard distribution strategy chosen as a str value during ddp.

    • scatter: The default shard strategy applied by WebDataset, where each node gets a unique set of shards, which are permanently pre-allocated and never changed at runtime.

    • replicate: Optional shard strategy, where each node gets all of the set of shards available in the tarred dataset, which are permanently pre-allocated and never changed at runtime. The benefit of replication is that it allows each node to sample data points from the entire dataset independently of other nodes, and reduces dependence on value of shuffle_n.

      Warning

      Replicated strategy allows every node to sample the entire set of available tarfiles, and therefore more than one node may sample the same tarfile, and even sample the same data points! As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Scattered strategy, on the other hand, on specific occasions (when the number of shards is not divisible with world_size), will not sample the entire dataset. For these reasons it is not advisable to use tarred datasets as validation or test datasets.

  • global_rank (int) – Worker rank, used for partitioning shards. Defaults to 0.

  • world_size (int) – Total number of processes, used for partitioning shards. Defaults to 0.

  • return_sample_id (bool) – whether to return the sample_id as a part of each sample

Text-to-Text Datasets for Hybrid ASR-TTS models#

class nemo.collections.asr.data.text_to_text.TextToTextDataset(*args: Any, **kwargs: Any)[source]#

Bases: TextToTextDatasetBase, Dataset

Text-to-Text Map-style Dataset for hybrid ASR-TTS models

collate_fn(batch: List[Union[TextToTextItem, tuple]]) Union[TextToTextBatch, TextOrAudioToTextBatch, tuple][source]#

Collate function for dataloader Can accept mixed batch of text-to-text items and audio-text items (typical for ASR)

class nemo.collections.asr.data.text_to_text.TextToTextIterableDataset(*args: Any, **kwargs: Any)[source]#

Bases: TextToTextDatasetBase, IterableDataset

Text-to-Text Iterable Dataset for hybrid ASR-TTS models Only part necessary for current process should be loaded and stored

collate_fn(batch: List[Union[TextToTextItem, tuple]]) Union[TextToTextBatch, TextOrAudioToTextBatch, tuple][source]#

Collate function for dataloader Can accept mixed batch of text-to-text items and audio-text items (typical for ASR)

Subword Encoding Datasets#

class nemo.collections.asr.data.audio_to_text.AudioToBPEDataset(*args: Any, **kwargs: Any)[source]#

Bases: _AudioTextDataset

Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations (in seconds). Each new line is a different sample. Example below: {“audio_filepath”: “/path/to/audio.wav”, “text_filepath”: “/path/to/audio.txt”, “duration”: 23.147} … {“audio_filepath”: “/path/to/audio.wav”, “text”: “the transcription”, “offset”: 301.75, “duration”: 0.82, “utt”: “utterance_id”, “ctm_utt”: “en_4156”, “side”: “A”}

In practice, the dataset and manifest used for character encoding and byte pair encoding are exactly the same. The only difference lies in how the dataset tokenizes the text in the manifest.

Parameters
  • manifest_filepath – Path to manifest json as described above. Can be comma-separated paths.

  • tokenizer – A subclass of the Tokenizer wrapper found in the common collection, nemo.collections.common.tokenizers.TokenizerSpec. ASR Models support a subset of all available tokenizers.

  • sample_rate (int) – Sample rate to resample loaded audio to

  • int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.

  • augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio

  • max_duration – If audio exceeds this length, do not include in dataset

  • min_duration – If audio is less than this length, do not include in dataset

  • max_utts – Limit number of utterances

  • trim – Whether to trim silence segments

  • use_start_end_token – Boolean which dictates whether to add [BOS] and [EOS] tokens to beginning and ending of speech respectively.

  • return_sample_id (bool) – whether to return the sample_id as a part of each sample

  • channel_selector (int | Iterable[int] | str) – select a single channel or a subset of channels from multi-channel audio. If set to ‘average’, it performs averaging across channels. Disabled if set to None. Defaults to None. Uses zero-based indexing.

property output_types: Optional[Dict[str, NeuralType]]#

Returns definitions of module output ports.

class nemo.collections.asr.data.audio_to_text.TarredAudioToBPEDataset(*args: Any, **kwargs: Any)[source]#

Bases: _TarredAudioToTextDataset

A similar Dataset to the AudioToBPEDataset, but which loads tarred audio files.

Accepts a single comma-separated JSON manifest file (in the same style as for the AudioToBPEDataset), as well as the path(s) to the tarball(s) containing the wav files. Each line of the manifest should contain the information for one audio file, including at least the transcript and name of the audio file within the tarball.

Valid formats for the audio_tar_filepaths argument include: (1) a single string that can be brace-expanded, e.g. ‘path/to/audio.tar’ or ‘path/to/audio_{1..100}.tar.gz’, or (2) a list of file paths that will not be brace-expanded, e.g. [‘audio_1.tar’, ‘audio_2.tar’, …].

See the WebDataset documentation for more information about accepted data and input formats.

If using multiple workers the number of shards should be divisible by world_size to ensure an even split among workers. If it is not divisible, logging will give a warning but training will proceed. In addition, if using mutiprocessing, each shard MUST HAVE THE SAME NUMBER OF ENTRIES after filtering is applied. We currently do not check for this, but your program may hang if the shards are uneven!

Notice that a few arguments are different from the AudioToBPEDataset; for example, shuffle (bool) has been replaced by shuffle_n (int).

Additionally, please note that the len() of this DataLayer is assumed to be the length of the manifest after filtering. An incorrect manifest length may lead to some DataLoader issues down the line.

Parameters
  • audio_tar_filepaths – Either a list of audio tarball filepaths, or a string (can be brace-expandable).

  • manifest_filepath (str) – Path to the manifest.

  • tokenizer (TokenizerSpec) – Either a Word Piece Encoding tokenizer (BERT), or a Sentence Piece Encoding tokenizer (BPE). The CTC blank symbol is automatically added later for models using ctc.

  • sample_rate (int) – Sample rate to resample loaded audio to

  • int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.

  • augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio

  • shuffle_n (int) – How many samples to look ahead and load to be shuffled. See WebDataset documentation for more details. Defaults to 0.

  • min_duration (float) – Dataset parameter. All training files which have a duration less than min_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to 0.1.

  • max_duration (float) – Dataset parameter. All training files which have a duration more than max_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to None.

  • trim (bool) – Whether to use trim silence from beginning and end of audio signal using librosa.effects.trim(). Defaults to False.

  • use_start_end_token – Boolean which dictates whether to add [BOS] and [EOS] tokens to beginning and ending of speech respectively.

  • pad_id (id) – Token used to pad when collating samples in batches. If this is None, pads using 0s. Defaults to None.

  • shard_strategy (str) –

    Tarred dataset shard distribution strategy chosen as a str value during ddp.

    • scatter: The default shard strategy applied by WebDataset, where each node gets a unique set of shards, which are permanently pre-allocated and never changed at runtime.

    • replicate: Optional shard strategy, where each node gets all of the set of shards available in the tarred dataset, which are permanently pre-allocated and never changed at runtime. The benefit of replication is that it allows each node to sample data points from the entire dataset independently of other nodes, and reduces dependence on value of shuffle_n.

      Warning

      Replicated strategy allows every node to sample the entire set of available tarfiles, and therefore more than one node may sample the same tarfile, and even sample the same data points! As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Scattered strategy, on the other hand, on specific occasions (when the number of shards is not divisible with world_size), will not sample the entire dataset. For these reasons it is not advisable to use tarred datasets as validation or test datasets.

  • global_rank (int) – Worker rank, used for partitioning shards. Defaults to 0.

  • world_size (int) – Total number of processes, used for partitioning shards. Defaults to 0.

  • return_sample_id (bool) – whether to return the sample_id as a part of each sample

Audio Preprocessors#

Audio Augmentors#

class nemo.collections.asr.parts.preprocessing.perturb.SpeedPerturbation(sr, resample_type, min_speed_rate=0.9, max_speed_rate=1.1, num_rates=5, rng=None)[source]#

Bases: Perturbation

Performs Speed Augmentation by re-sampling the data to a different sampling rate, which does not preserve pitch.

Note: This is a very slow operation for online augmentation. If space allows, it is preferable to pre-compute and save the files to augment the dataset.

Parameters
  • sr – Original sampling rate.

  • resample_type – Type of resampling operation that will be performed. For better speed using resampy’s fast resampling method, use resample_type=’kaiser_fast’. For high-quality resampling, set resample_type=’kaiser_best’. To use scipy.signal.resample, set resample_type=’fft’ or resample_type=’scipy’

  • min_speed_rate – Minimum sampling rate modifier.

  • max_speed_rate – Maximum sampling rate modifier.

  • num_rates – Number of discrete rates to allow. Can be a positive or negative integer. If a positive integer greater than 0 is provided, the range of speed rates will be discretized into num_rates values. If a negative integer or 0 is provided, the full range of speed rates will be sampled uniformly. Note: If a positive integer is provided and the resultant discretized range of rates contains the value ‘1.0’, then those samples with rate=1.0, will not be augmented at all and simply skipped. This is to unnecessary augmentation and increase computation time. Effective augmentation chance in such a case is = prob * (num_rates - 1 / num_rates) * 100`% chance where `prob is the global probability of a sample being augmented.

  • rng – Random seed. Default is None

max_augmentation_length(length)[source]#
perturb(data)[source]#
class nemo.collections.asr.parts.preprocessing.perturb.TimeStretchPerturbation(min_speed_rate=0.9, max_speed_rate=1.1, num_rates=5, n_fft=512, rng=None)[source]#

Bases: Perturbation

Time-stretch an audio series by a fixed rate while preserving pitch, based on [1, 2].

Note: This is a simplified implementation, intended primarily for reference and pedagogical purposes. It makes no attempt to handle transients, and is likely to produce audible artifacts.

Reference [1] [Ellis, D. P. W. “A phase vocoder in Matlab.” Columbia University, 2002.] (http://www.ee.columbia.edu/~dpwe/resources/matlab/pvoc/) [2] [librosa.effects.time_stretch] (https://librosa.github.io/librosa/generated/librosa.effects.time_stretch.html)

Parameters
  • min_speed_rate – Minimum sampling rate modifier.

  • max_speed_rate – Maximum sampling rate modifier.

  • num_rates – Number of discrete rates to allow. Can be a positive or negative integer. If a positive integer greater than 0 is provided, the range of speed rates will be discretized into num_rates values. If a negative integer or 0 is provided, the full range of speed rates will be sampled uniformly. Note: If a positive integer is provided and the resultant discretized range of rates contains the value ‘1.0’, then those samples with rate=1.0, will not be augmented at all and simply skipped. This is to avoid unnecessary augmentation and increase computation time. Effective augmentation chance in such a case is = prob * (num_rates - 1 / num_rates) * 100`% chance where `prob is the global probability of a sample being augmented.

  • n_fft – Number of fft filters to be computed.

  • rng – Random seed. Default is None

max_augmentation_length(length)[source]#
perturb(data)[source]#
class nemo.collections.asr.parts.preprocessing.perturb.GainPerturbation(min_gain_dbfs=-10, max_gain_dbfs=10, rng=None)[source]#

Bases: Perturbation

Applies random gain to the audio.

Parameters
  • min_gain_dbfs (float) – Min gain level in dB

  • max_gain_dbfs (float) – Max gain level in dB

  • rng (int) – Random seed. Default is None

perturb(data)[source]#
class nemo.collections.asr.parts.preprocessing.perturb.ImpulsePerturbation(manifest_path=None, audio_tar_filepaths=None, shuffle_n=128, normalize_impulse=False, shift_impulse=False, rng=None)[source]#

Bases: Perturbation

Convolves audio with a Room Impulse Response.

Parameters
  • manifest_path (list) – Manifest file for RIRs

  • audio_tar_filepaths (list) – Tar files, if RIR audio files are tarred

  • shuffle_n (int) – Shuffle parameter for shuffling buffered files from the tar files

  • normalize_impulse (bool) – Normalize impulse response to zero mean and amplitude 1

  • shift_impulse (bool) – Shift impulse response to adjust for delay at the beginning

  • rng (int) – Random seed. Default is None

perturb(data)[source]#
class nemo.collections.asr.parts.preprocessing.perturb.ShiftPerturbation(min_shift_ms=-5.0, max_shift_ms=5.0, rng=None)[source]#

Bases: Perturbation

Perturbs audio by shifting the audio in time by a random amount between min_shift_ms and max_shift_ms. The final length of the audio is kept unaltered by padding the audio with zeros.

Parameters
  • min_shift_ms (float) – Minimum time in milliseconds by which audio will be shifted

  • max_shift_ms (float) – Maximum time in milliseconds by which audio will be shifted

  • rng (int) – Random seed. Default is None

perturb(data)[source]#
class nemo.collections.asr.parts.preprocessing.perturb.NoisePerturbation(manifest_path=None, min_snr_db=10, max_snr_db=50, max_gain_db=300.0, rng=None, audio_tar_filepaths=None, shuffle_n=100, orig_sr=16000)[source]#

Bases: Perturbation

Perturbation that adds noise to input audio.

Parameters
  • manifest_path (str) – Manifest file with paths to noise files

  • min_snr_db (float) – Minimum SNR of audio after noise is added

  • max_snr_db (float) – Maximum SNR of audio after noise is added

  • max_gain_db (float) – Maximum gain that can be applied on the noise sample

  • audio_tar_filepaths (list) – Tar files, if noise audio files are tarred

  • shuffle_n (int) – Shuffle parameter for shuffling buffered files from the tar files

  • orig_sr (int) – Original sampling rate of the noise files

  • rng (int) – Random seed. Default is None

get_one_noise_sample(target_sr)[source]#
property orig_sr#
perturb(data, ref_mic=0)[source]#
Parameters
  • data (AudioSegment) – audio data

  • ref_mic (int) – reference mic index for scaling multi-channel audios

perturb_with_foreground_noise(data, noise, data_rms=None, max_noise_dur=2, max_additions=1, ref_mic=0)[source]#
Parameters
  • data (AudioSegment) – audio data

  • noise (AudioSegment) – noise data

  • data_rms (Union[float, List[float]) – rms_db for data input

  • max_noise_dur – (float): max noise duration

  • max_additions (int) – number of times for adding noise

  • ref_mic (int) – reference mic index for scaling multi-channel audios

perturb_with_input_noise(data, noise, data_rms=None, ref_mic=0)[source]#
Parameters
  • data (AudioSegment) – audio data

  • noise (AudioSegment) – noise data

  • data_rms (Union[float, List[float]) – rms_db for data input

  • ref_mic (int) – reference mic index for scaling multi-channel audios

class nemo.collections.asr.parts.preprocessing.perturb.WhiteNoisePerturbation(min_level=-90, max_level=-46, rng=None)[source]#

Bases: Perturbation

Perturbation that adds white noise to an audio file in the training dataset.

Parameters
  • min_level (int) – Minimum level in dB at which white noise should be added

  • max_level (int) – Maximum level in dB at which white noise should be added

  • rng (int) – Random seed. Default is None

perturb(data)[source]#
class nemo.collections.asr.parts.preprocessing.perturb.RirAndNoisePerturbation(rir_manifest_path=None, rir_prob=0.5, noise_manifest_paths=None, noise_prob=1.0, min_snr_db=0, max_snr_db=50, rir_tar_filepaths=None, rir_shuffle_n=100, noise_tar_filepaths=None, apply_noise_rir=False, orig_sample_rate=None, max_additions=5, max_duration=2.0, bg_noise_manifest_paths=None, bg_noise_prob=1.0, bg_min_snr_db=10, bg_max_snr_db=50, bg_noise_tar_filepaths=None, bg_orig_sample_rate=None, rng=None)[source]#

Bases: Perturbation

RIR augmentation with additive foreground and background noise. In this implementation audio data is augmented by first convolving the audio with a Room Impulse Response and then adding foreground noise and background noise at various SNRs. RIR, foreground and background noises should either be supplied with a manifest file or as tarred audio files (faster).

Different sets of noise audio files based on the original sampling rate of the noise. This is useful while training a mixed sample rate model. For example, when training a mixed model with 8 kHz and 16 kHz audio with a target sampling rate of 16 kHz, one would want to augment 8 kHz data with 8 kHz noise rather than 16 kHz noise.

Parameters
  • rir_manifest_path – Manifest file for RIRs

  • rir_tar_filepaths – Tar files, if RIR audio files are tarred

  • rir_prob – Probability of applying a RIR

  • noise_manifest_paths – Foreground noise manifest path

  • min_snr_db – Min SNR for foreground noise

  • max_snr_db – Max SNR for background noise,

  • noise_tar_filepaths – Tar files, if noise files are tarred

  • apply_noise_rir – Whether to convolve foreground noise with a a random RIR

  • orig_sample_rate – Original sampling rate of foreground noise audio

  • max_additions – Max number of times foreground noise is added to an utterance,

  • max_duration – Max duration of foreground noise

  • bg_noise_manifest_paths – Background noise manifest path

  • bg_min_snr_db – Min SNR for background noise

  • bg_max_snr_db – Max SNR for background noise

  • bg_noise_tar_filepaths – Tar files, if noise files are tarred

  • bg_orig_sample_rate – Original sampling rate of background noise audio

  • rng – Random seed. Default is None

perturb(data)[source]#
class nemo.collections.asr.parts.preprocessing.perturb.TranscodePerturbation(codecs=None, rng=None)[source]#

Bases: Perturbation

Audio codec augmentation. This implementation uses sox to transcode audio with low rate audio codecs, so users need to make sure that the installed sox version supports the codecs used here (G711 and amr-nb).

Parameters
  • codecs (List[str]) – A list of codecs to be trancoded to. Default is None.

  • rng (int) – Random seed. Default is None.

perturb(data)[source]#

Miscellaneous Classes#

CTC Decoding#

class nemo.collections.asr.parts.submodules.ctc_decoding.CTCDecoding(decoding_cfg, vocabulary)[source]#

Bases: AbstractCTCDecoding

Used for performing CTC auto-regressive / non-auto-regressive decoding of the logprobs for character based models.

Parameters
  • decoding_cfg

    A dict-like object which contains the following key-value pairs. strategy: str value which represents the type of decoding that can occur.

    Possible values are : - greedy (for greedy decoding). - beam (for DeepSpeed KenLM based decoding).

    compute_timestamps: A bool flag, which determines whether to compute the character/subword, or

    word based timestamp mapping the output log-probabilities to discrite intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.

    ctc_timestamp_type: A str value, which represents the types of timestamps that should be calculated.

    Can take the following values - “char” for character/subword time stamps, “word” for word level time stamps and “all” (default), for both character level and word level time stamps.

    word_seperator: Str token representing the seperator between words.

    preserve_alignments: Bool flag which preserves the history of logprobs generated during

    decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, logprobs is a torch.Tensors.

    confidence_cfg: A dict-like object which contains the following key-value pairs related to confidence

    scores. In order to obtain hypotheses with confidence scores, please utilize ctc_decoder_predictions_tensor function with the preserve_frame_confidence flag set to True.

    preserve_frame_confidence: Bool flag which preserves the history of per-frame confidence scores

    generated during decoding. When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of floats.

    preserve_token_confidence: Bool flag which preserves the history of per-token confidence scores

    generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for token_confidence in it. Here, token_confidence is a List of floats.

    The length of the list corresponds to the number of recognized tokens.

    preserve_word_confidence: Bool flag which preserves the history of per-word confidence scores

    generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for word_confidence in it. Here, word_confidence is a List of floats.

    The length of the list corresponds to the number of recognized words.

    exclude_blank: Bool flag indicating that blank token confidence scores are to be excluded

    from the token_confidence.

    aggregation: Which aggregation type to use for collapsing per-token confidence into per-word confidence.

    Valid options are mean, min, max, prod.

    method_cfg: A dict-like object which contains the method name and settings to compute per-frame

    confidence scores.

    name: The method name (str).
    Supported values:
    • ’max_prob’ for using the maximum token probability as a confidence.

    • ’entropy’ for using a normalized entropy of a log-likelihood vector.

    entropy_type: Which type of entropy to use (str).

    Used if confidence_method_cfg.name is set to entropy. Supported values:

    • ’gibbs’ for the (standard) Gibbs entropy. If the alpha (α) is provided,

      the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the alpha should comply the following inequality: (log(V)+2-sqrt(log^2(V)+4))/(2*log(V)) <= α <= (1+log(V-1))/log(V-1) where V is the model vocabulary size.

    • ’tsallis’ for the Tsallis entropy with the Boltzmann constant one.

      Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

    • ’renyi’ for the Rényi entropy.

      Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

    alpha: Power scale for logsoftmax (α for entropies). Here we restrict it to be > 0.

    When the alpha equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

    entropy_norm: A mapping of the entropy value to the interval [0,1].
    Supported values:
    • ’lin’ for using the linear mapping.

    • ’exp’ for using exponential mapping with linear shift.

    batch_dim_index: Index of the batch dimension of targets and predictions parameters of

    ctc_decoder_predictions_tensor methods. Can be either 0 or 1.

    The config may further contain the following sub-dictionaries: “greedy”:

    preserve_alignments: Same as above, overrides above value. compute_timestamps: Same as above, overrides above value. preserve_frame_confidence: Same as above, overrides above value. confidence_method_cfg: Same as above, overrides confidence_cfg.method_cfg.

    ”beam”:
    beam_size: int, defining the beam size for beam search. Must be >= 1.

    If beam_size == 1, will perform cached greedy search. This might be slightly different results compared to the greedy search above.

    return_best_hypothesis: optional bool, whether to return just the best hypothesis or all of the

    hypotheses after beam search has concluded. This flag is set by default.

    beam_alpha: float, the strength of the Language model on the final score of a token.

    final_score = acoustic_score + beam_alpha * lm_score + beam_beta * seq_length.

    beam_beta: float, the strength of the sequence length penalty on the final score of a token.

    final_score = acoustic_score + beam_alpha * lm_score + beam_beta * seq_length.

    kenlm_path: str, path to a KenLM ARPA or .binary file (depending on the strategy chosen).

    If the path is invalid (file is not found at path), will raise a deferred error at the moment of calculation of beam search, so that users may update / change the decoding strategy to point to the correct file.

  • blank_id – The id of the RNNT blank token.

decode_ids_to_tokens(tokens: List[int]) List[str][source]#

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters

tokens – List of int representing the token ids.

Returns

A list of decoded tokens.

decode_tokens_to_str(tokens: List[int]) str[source]#

Implemented by subclass in order to decoder a token list into a string.

Parameters

tokens – List of int representing the token ids.

Returns

A decoded string.

class nemo.collections.asr.parts.submodules.ctc_decoding.CTCBPEDecoding(decoding_cfg, tokenizer: TokenizerSpec)[source]#

Bases: AbstractCTCDecoding

Used for performing CTC auto-regressive / non-auto-regressive decoding of the logprobs for subword based models.

Parameters
  • decoding_cfg

    A dict-like object which contains the following key-value pairs. strategy: str value which represents the type of decoding that can occur.

    Possible values are : - greedy (for greedy decoding). - beam (for DeepSpeed KenLM based decoding).

    compute_timestamps: A bool flag, which determines whether to compute the character/subword, or

    word based timestamp mapping the output log-probabilities to discrite intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.

    ctc_timestamp_type: A str value, which represents the types of timestamps that should be calculated.

    Can take the following values - “char” for character/subword time stamps, “word” for word level time stamps and “all” (default), for both character level and word level time stamps.

    word_seperator: Str token representing the seperator between words.

    preserve_alignments: Bool flag which preserves the history of logprobs generated during

    decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, logprobs is a torch.Tensors.

    confidence_cfg: A dict-like object which contains the following key-value pairs related to confidence

    scores. In order to obtain hypotheses with confidence scores, please utilize ctc_decoder_predictions_tensor function with the preserve_frame_confidence flag set to True.

    preserve_frame_confidence: Bool flag which preserves the history of per-frame confidence scores

    generated during decoding. When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of floats.

    preserve_token_confidence: Bool flag which preserves the history of per-token confidence scores

    generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for token_confidence in it. Here, token_confidence is a List of floats.

    The length of the list corresponds to the number of recognized tokens.

    preserve_word_confidence: Bool flag which preserves the history of per-word confidence scores

    generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for word_confidence in it. Here, word_confidence is a List of floats.

    The length of the list corresponds to the number of recognized words.

    exclude_blank: Bool flag indicating that blank token confidence scores are to be excluded

    from the token_confidence.

    aggregation: Which aggregation type to use for collapsing per-token confidence into per-word confidence.

    Valid options are mean, min, max, prod.

    method_cfg: A dict-like object which contains the method name and settings to compute per-frame

    confidence scores.

    name: The method name (str).
    Supported values:
    • ’max_prob’ for using the maximum token probability as a confidence.

    • ’entropy’ for using a normalized entropy of a log-likelihood vector.

    entropy_type: Which type of entropy to use (str).

    Used if confidence_method_cfg.name is set to entropy. Supported values:

    • ’gibbs’ for the (standard) Gibbs entropy. If the alpha (α) is provided,

      the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the alpha should comply the following inequality: (log(V)+2-sqrt(log^2(V)+4))/(2*log(V)) <= α <= (1+log(V-1))/log(V-1) where V is the model vocabulary size.

    • ’tsallis’ for the Tsallis entropy with the Boltzmann constant one.

      Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

    • ’renyi’ for the Rényi entropy.

      Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

    alpha: Power scale for logsoftmax (α for entropies). Here we restrict it to be > 0.

    When the alpha equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

    entropy_norm: A mapping of the entropy value to the interval [0,1].
    Supported values:
    • ’lin’ for using the linear mapping.

    • ’exp’ for using exponential mapping with linear shift.

    batch_dim_index: Index of the batch dimension of targets and predictions parameters of

    ctc_decoder_predictions_tensor methods. Can be either 0 or 1.

    The config may further contain the following sub-dictionaries: “greedy”:

    preserve_alignments: Same as above, overrides above value. compute_timestamps: Same as above, overrides above value. preserve_frame_confidence: Same as above, overrides above value. confidence_method_cfg: Same as above, overrides confidence_cfg.method_cfg.

    ”beam”:
    beam_size: int, defining the beam size for beam search. Must be >= 1.

    If beam_size == 1, will perform cached greedy search. This might be slightly different results compared to the greedy search above.

    return_best_hypothesis: optional bool, whether to return just the best hypothesis or all of the

    hypotheses after beam search has concluded. This flag is set by default.

    beam_alpha: float, the strength of the Language model on the final score of a token.

    final_score = acoustic_score + beam_alpha * lm_score + beam_beta * seq_length.

    beam_beta: float, the strength of the sequence length penalty on the final score of a token.

    final_score = acoustic_score + beam_alpha * lm_score + beam_beta * seq_length.

    kenlm_path: str, path to a KenLM ARPA or .binary file (depending on the strategy chosen).

    If the path is invalid (file is not found at path), will raise a deferred error at the moment of calculation of beam search, so that users may update / change the decoding strategy to point to the correct file.

  • tokenizer – NeMo tokenizer object, which inherits from TokenizerSpec.

decode_ids_to_tokens(tokens: List[int]) List[str][source]#

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters

tokens – List of int representing the token ids.

Returns

A list of decoded tokens.

decode_tokens_to_str(tokens: List[int]) str[source]#

Implemented by subclass in order to decoder a token list into a string.

Parameters

tokens – List of int representing the token ids.

Returns

A decoded string.

class nemo.collections.asr.parts.submodules.ctc_greedy_decoding.GreedyCTCInfer(blank_id: int, preserve_alignments: bool = False, compute_timestamps: bool = False, preserve_frame_confidence: bool = False, confidence_method_cfg: Optional[omegaconf.DictConfig] = None)[source]#

Bases: Typing, ConfidenceMethodMixin

A greedy CTC decoder.

Provides a common abstraction for sample level and batch level greedy decoding.

Parameters
  • blank_index – int index of the blank token. Can be 0 or len(vocabulary).

  • preserve_alignments – Bool flag which preserves the history of logprobs generated during decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, logprobs is a torch.Tensors.

  • compute_timestamps – A bool flag, which determines whether to compute the character/subword, or word based timestamp mapping the output log-probabilities to discrite intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.

  • preserve_frame_confidence – Bool flag which preserves the history of per-frame confidence scores generated during decoding. When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of floats.

  • confidence_method_cfg

    A dict-like object which contains the method name and settings to compute per-frame confidence scores.

    name: The method name (str).
    Supported values:
    • ’max_prob’ for using the maximum token probability as a confidence.

    • ’entropy’ for using a normalized entropy of a log-likelihood vector.

    entropy_type: Which type of entropy to use (str). Used if confidence_method_cfg.name is set to entropy.
    Supported values:
    • ’gibbs’ for the (standard) Gibbs entropy. If the alpha (α) is provided,

      the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the alpha should comply the following inequality: (log(V)+2-sqrt(log^2(V)+4))/(2*log(V)) <= α <= (1+log(V-1))/log(V-1) where V is the model vocabulary size.

    • ’tsallis’ for the Tsallis entropy with the Boltzmann constant one.

      Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

    • ’renyi’ for the Rényi entropy.

      Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

    alpha: Power scale for logsoftmax (α for entropies). Here we restrict it to be > 0.

    When the alpha equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

    entropy_norm: A mapping of the entropy value to the interval [0,1].
    Supported values:
    • ’lin’ for using the linear mapping.

    • ’exp’ for using exponential mapping with linear shift.

forward(decoder_output: torch.Tensor, decoder_lengths: torch.Tensor)[source]#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-repressively.

Parameters
  • decoder_output – A tensor of size (batch, timesteps, features) or (batch, timesteps) (each timestep is a label).

  • decoder_lengths – list of int representing the length of each sequence output sequence.

Returns

packed list containing batch number of sentences (Hypotheses).

property input_types#

Returns definitions of module input ports.

property output_types#

Returns definitions of module output ports.

class nemo.collections.asr.parts.submodules.ctc_beam_decoding.BeamCTCInfer(blank_id: int, beam_size: int, search_type: str = 'default', return_best_hypothesis: bool = True, preserve_alignments: bool = False, compute_timestamps: bool = False, beam_alpha: float = 1.0, beam_beta: float = 0.0, kenlm_path: Optional[str] = None, flashlight_cfg: Optional[FlashlightConfig] = None, pyctcdecode_cfg: Optional[PyCTCDecodeConfig] = None)[source]#

Bases: AbstractBeamCTCInfer

A greedy CTC decoder.

Provides a common abstraction for sample level and batch level greedy decoding.

Parameters
  • blank_index – int index of the blank token. Can be 0 or len(vocabulary).

  • preserve_alignments – Bool flag which preserves the history of logprobs generated during decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, logprobs is a torch.Tensors.

  • compute_timestamps – A bool flag, which determines whether to compute the character/subword, or word based timestamp mapping the output log-probabilities to discrite intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.

Open Seq2Seq Beam Search Algorithm (DeepSpeed)

Parameters
  • x – Tensor of shape [B, T, V+1], where B is the batch size, T is the maximum sequence length, and V is the vocabulary size. The tensor contains log-probabilities.

  • out_len – Tensor of shape [B], contains lengths of each sequence in the batch.

Returns

A list of NBestHypotheses objects, one for each sequence in the batch.

Flashlight Beam Search Algorithm. Should support Char and Subword models.

Parameters
  • x – Tensor of shape [B, T, V+1], where B is the batch size, T is the maximum sequence length, and V is the vocabulary size. The tensor contains log-probabilities.

  • out_len – Tensor of shape [B], contains lengths of each sequence in the batch.

Returns

A list of NBestHypotheses objects, one for each sequence in the batch.

forward(decoder_output: torch.Tensor, decoder_lengths: torch.Tensor) Tuple[List[Union[Hypothesis, NBestHypotheses]]][source]#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-repressively.

Parameters
  • decoder_output – A tensor of size (batch, timesteps, features).

  • decoder_lengths – list of int representing the length of each sequence output sequence.

Returns

packed list containing batch number of sentences (Hypotheses).

set_decoding_type(decoding_type: str)[source]#

Sets the decoding type of the framework. Can support either char or subword models.

Parameters

decoding_type – Str corresponding to decoding type. Only supports “char” and “subword”.

RNNT Decoding#

Hypotheses#

class nemo.collections.asr.parts.utils.rnnt_utils.Hypothesis(score: float, y_sequence: ~typing.Union[~typing.List[int], torch.Tensor], text: ~typing.Optional[str] = None, dec_out: ~typing.Optional[~typing.List[torch.Tensor]] = None, dec_state: ~typing.Optional[~typing.Union[~typing.List[~typing.List[torch.Tensor]], ~typing.List[torch.Tensor]]] = None, timestep: ~typing.Union[~typing.List[int], torch.Tensor] = <factory>, alignments: ~typing.Optional[~typing.Union[~typing.List[int], ~typing.List[~typing.List[int]]]] = None, frame_confidence: ~typing.Optional[~typing.Union[~typing.List[float], ~typing.List[~typing.List[float]]]] = None, token_confidence: ~typing.Optional[~typing.List[float]] = None, word_confidence: ~typing.Optional[~typing.List[float]] = None, length: ~typing.Union[int, torch.Tensor] = 0, y: ~typing.Optional[~typing.List[torch.tensor]] = None, lm_state: ~typing.Optional[~typing.Union[~typing.Dict[str, ~typing.Any], ~typing.List[~typing.Any]]] = None, lm_scores: ~typing.Optional[torch.Tensor] = None, ngram_lm_state: ~typing.Optional[~typing.Union[~typing.Dict[str, ~typing.Any], ~typing.List[~typing.Any]]] = None, tokens: ~typing.Optional[~typing.Union[~typing.List[int], torch.Tensor]] = None, last_token: ~typing.Optional[torch.Tensor] = None)[source]#

Bases: object

Hypothesis class for beam search algorithms.

score: A float score obtained from an AbstractRNNTDecoder module’s score_hypothesis method.

y_sequence: Either a sequence of integer ids pointing to some vocabulary, or a packed torch.Tensor

behaving in the same manner. dtype must be torch.Long in the latter case.

dec_state: A list (or list of list) of LSTM-RNN decoder states. Can be None.

text: (Optional) A decoded string after processing via CTC / RNN-T decoding (removing the CTC/RNNT

blank tokens, and optionally merging word-pieces). Should be used as decoded string for Word Error Rate calculation.

timestep: (Optional) A list of integer indices representing at which index in the decoding

process did the token appear. Should be of same length as the number of non-blank tokens.

alignments: (Optional) Represents the CTC / RNNT token alignments as integer tokens along an axis of

time T (for CTC) or Time x Target (TxU). For CTC, represented as a single list of integer indices. For RNNT, represented as a dangling list of list of integer indices. Outer list represents Time dimension (T), inner list represents Target dimension (U). The set of valid indices includes the CTC / RNNT blank token in order to represent alignments.

frame_confidence: (Optional) Represents the CTC / RNNT per-frame confidence scores as token probabilities

along an axis of time T (for CTC) or Time x Target (TxU). For CTC, represented as a single list of float indices. For RNNT, represented as a dangling list of list of float indices. Outer list represents Time dimension (T), inner list represents Target dimension (U).

token_confidence: (Optional) Represents the CTC / RNNT per-token confidence scores as token probabilities

along an axis of Target U. Represented as a single list of float indices.

word_confidence: (Optional) Represents the CTC / RNNT per-word confidence scores as token probabilities

along an axis of Target U. Represented as a single list of float indices.

length: Represents the length of the sequence (the original length without padding), otherwise

defaults to 0.

y: (Unused) A list of torch.Tensors representing the list of hypotheses.

lm_state: (Unused) A dictionary state cache used by an external Language Model.

lm_scores: (Unused) Score of the external Language Model.

ngram_lm_state: (Optional) State of the external n-gram Language Model.

tokens: (Optional) A list of decoded tokens (can be characters or word-pieces.

last_token (Optional): A token or batch of tokens which was predicted in the last step.

class nemo.collections.asr.parts.utils.rnnt_utils.NBestHypotheses(n_best_hypotheses: Optional[List[Hypothesis]])[source]#

Bases: object

List of N best hypotheses

Adapter Networks#

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.MultiHeadAttentionAdapter(*args: Any, **kwargs: Any)[source]#

Bases: MultiHeadAttention, AdapterModuleUtil

Multi-Head Attention layer of Transformer. :param n_head: number of heads :type n_head: int :param n_feat: size of the features :type n_feat: int :param dropout_rate: dropout rate :type dropout_rate: float :param proj_dim: Optional integer value for projection before computing attention.

If None, then there is no projection (equivalent to proj_dim = n_feat). If > 0, then will project the n_feat to proj_dim before calculating attention. If <0, then will equal n_head, so that each head has a projected dimension of 1.

forward(query, key, value, mask, pos_emb=None, cache=None)[source]#

Compute ‘Scaled Dot Product Attention’. :param query: (batch, time1, size) :type query: torch.Tensor :param key: (batch, time2, size) :type key: torch.Tensor :param value: (batch, time2, size) :type value: torch.Tensor :param mask: (batch, time1, time2) :type mask: torch.Tensor :param cache: (batch, time_cache, size) :type cache: torch.Tensor

Returns

transformed value (batch, time1, d_model) weighted by the query dot key attention cache (torch.Tensor) : (batch, time_cache_next, size)

Return type

output (torch.Tensor)

reset_parameters()[source]#
get_default_strategy_config() dataclass[source]#

Returns a default adapter module strategy.


class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.RelPositionMultiHeadAttentionAdapter(*args: Any, **kwargs: Any)[source]#

Bases: RelPositionMultiHeadAttention, AdapterModuleUtil

Multi-Head Attention layer of Transformer-XL with support of relative positional encoding. Paper: https://arxiv.org/abs/1901.02860 :param n_head: number of heads :type n_head: int :param n_feat: size of the features :type n_feat: int :param dropout_rate: dropout rate :type dropout_rate: float :param proj_dim: Optional integer value for projection before computing attention.

If None, then there is no projection (equivalent to proj_dim = n_feat). If > 0, then will project the n_feat to proj_dim before calculating attention. If <0, then will equal n_head, so that each head has a projected dimension of 1.

Parameters

adapter_strategy – By default, MHAResidualAddAdapterStrategyConfig. An adapter composition function object.

forward(query, key, value, mask, pos_emb, cache=None)[source]#

Compute ‘Scaled Dot Product Attention’ with rel. positional encoding. :param query: (batch, time1, size) :type query: torch.Tensor :param key: (batch, time2, size) :type key: torch.Tensor :param value: (batch, time2, size) :type value: torch.Tensor :param mask: (batch, time1, time2) :type mask: torch.Tensor :param pos_emb: (batch, time1, size) :type pos_emb: torch.Tensor :param cache: (batch, time_cache, size) :type cache: torch.Tensor

Returns

transformed value (batch, time1, d_model) weighted by the query dot key attention cache_next (torch.Tensor) : (batch, time_cache_next, size)

Return type

output (torch.Tensor)

reset_parameters()[source]#
get_default_strategy_config() dataclass[source]#

Returns a default adapter module strategy.


class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.PositionalEncodingAdapter(*args: Any, **kwargs: Any)[source]#

Bases: PositionalEncoding, AdapterModuleUtil

Absolute positional embedding adapter.

Note

Absolute positional embedding value is added to the input tensor without residual connection ! Therefore, the input is changed, if you only require the positional embedding, drop the returned x !

Parameters
  • d_model (int) – The input dimension of x.

  • max_len (int) – The max sequence length.

  • xscale (float) – The input scaling factor. Defaults to 1.0.

  • adapter_strategy (AbstractAdapterStrategy) – By default, ReturnResultAdapterStrategyConfig. An adapter composition function object. NOTE: Since this is a positional encoding, it will not add a residual !

get_default_strategy_config() dataclass[source]#

Returns a default adapter module strategy.


class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.RelPositionalEncodingAdapter(*args: Any, **kwargs: Any)[source]#

Bases: RelPositionalEncoding, AdapterModuleUtil

Relative positional encoding for TransformerXL’s layers See : Appendix B in https://arxiv.org/abs/1901.02860

Note

Relative positional embedding value is not added to the input tensor ! Therefore, the input should be updated changed, if you only require the positional embedding, drop the returned x !

Parameters
  • d_model (int) – embedding dim

  • max_len (int) – maximum input length

  • xscale (bool) – whether to scale the input by sqrt(d_model)

  • adapter_strategy – By default, ReturnResultAdapterStrategyConfig. An adapter composition function object.

get_default_strategy_config() dataclass[source]#

Returns a default adapter module strategy.

Adapter Strategies#

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.MHAResidualAddAdapterStrategy(stochastic_depth: float = 0.0, l2_lambda: float = 0.0)[source]#

Bases: ResidualAddAdapterStrategy

An implementation of residual addition of an adapter module with its input for the MHA Adapters.

forward(input: torch.Tensor, adapter: torch.nn.Module, *, module: AdapterModuleMixin)[source]#

A basic strategy, comprising of a residual connection over the input, after forward pass by the underlying adapter. Additional work is done to pack and unpack the dictionary of inputs and outputs.

Note: The value tensor is added to the output of the attention adapter as the residual connection.

Parameters
  • input

    A dictionary of multiple input arguments for the adapter module. query, key, value: Original output tensor of the module, or the output of the

    previous adapter (if more than one adapters are enabled). mask: Attention mask. pos_emb: Optional positional embedding for relative encoding.

  • adapter – The adapter module that is currently required to perform the forward pass.

  • module – The calling module, in its entirety. It is a module that implements AdapterModuleMixin, therefore the strategy can access all other adapters in this module via module.adapter_layer.

Returns

The result tensor, after one of the active adapters has finished its forward passes.

compute_output(input: torch.Tensor, adapter: torch.nn.Module, *, module: AdapterModuleMixin) torch.Tensor[source]#

Compute the output of a single adapter to some input.

Parameters
  • input – Original output tensor of the module, or the output of the previous adapter (if more than one adapters are enabled).

  • adapter – The adapter module that is currently required to perform the forward pass.

  • module – The calling module, in its entirety. It is a module that implements AdapterModuleMixin, therefore the strategy can access all other adapters in this module via module.adapter_layer.

Returns

The result tensor, after one of the active adapters has finished its forward passes.