NeMo ASR collection API#

Model Classes#

Modules#

Parts#

class nemo.collections.asr.parts.submodules.jasper.JasperBlock(*args: Any, **kwargs: Any)[source]#

Bases: torch.nn.Module, nemo.core.classes.mixins.adapter_mixins.AdapterModuleMixin, nemo.core.classes.mixins.access_mixins.AccessMixin

Constructs a single “Jasper” block. With modified parameters, also constructs other blocks for models such as QuartzNet and Citrinet.

  • For Jasper : separable flag should be False

  • For QuartzNet : separable flag should be True

  • For Citrinet : separable flag and se flag should be True

Note that above are general distinctions, each model has intricate differences that expand over multiple such blocks.

For further information about the differences between models which use JasperBlock, please review the configs for ASR models found in the ASR examples directory.

Parameters
  • inplanes – Number of input channels.

  • planes – Number of output channels.

  • repeat – Number of repeated sub-blocks (R) for this block.

  • kernel_size – Convolution kernel size across all repeated sub-blocks.

  • kernel_size_factor – Floating point scale value that is multiplied with kernel size, then rounded down to nearest odd integer to compose the kernel size. Defaults to 1.0.

  • stride – Stride of the convolutional layers.

  • dilation – Integer which defined dilation factor of kernel. Note that when dilation > 1, stride must be equal to 1.

  • padding – String representing type of padding. Currently only supports “same” padding, which symmetrically pads the input tensor with zeros.

  • dropout – Floating point value, determins percentage of output that is zeroed out.

  • activation – String representing activation functions. Valid activation functions are : {“hardtanh”: nn.Hardtanh, “relu”: nn.ReLU, “selu”: nn.SELU, “swish”: Swish}. Defaults to “relu”.

  • residual – Bool that determined whether a residual branch should be added or not. All residual branches are constructed using a pointwise convolution kernel, that may or may not perform strided convolution depending on the parameter residual_mode.

  • groups – Number of groups for Grouped Convolutions. Defaults to 1.

  • separable – Bool flag that describes whether Time-Channel depthwise separable convolution should be constructed, or ordinary convolution should be constructed.

  • heads – Number of “heads” for the masked convolution. Defaults to -1, which disables it.

  • normalization – String that represents type of normalization performed. Can be one of “batch”, “group”, “instance” or “layer” to compute BatchNorm1D, GroupNorm1D, InstanceNorm or LayerNorm (which are special cases of GroupNorm1D).

  • norm_groups – Number of groups used for GroupNorm (if normalization == “group”).

  • residual_mode – String argument which describes whether the residual branch should be simply added (“add”) or should first stride, then add (“stride_add”). Required when performing stride on parallel branch as well as utilizing residual add.

  • residual_panes – Number of residual panes, used for Jasper-DR models. Please refer to the paper.

  • conv_mask – Bool flag which determines whether to utilize masked convolutions or not. In general, it should be set to True.

  • se – Bool flag that determines whether Squeeze-and-Excitation layer should be used.

  • se_reduction_ratio – Integer value, which determines to what extend the hidden dimension of the SE intermediate step should be reduced. Larger values reduce number of parameters, but also limit the effectiveness of SE layers.

  • se_context_window – Integer value determining the number of timesteps that should be utilized in order to compute the averaged context window. Defaults to -1, which means it uses global context - such that all timesteps are averaged. If any positive integer is used, it will utilize limited context window of that size.

  • se_interpolation_mode – String used for interpolation mode of timestep dimension for SE blocks. Used only if context window is > 1. The modes available for resizing are: nearest, linear (3D-only), bilinear, area.

  • stride_last – Bool flag that determines whether all repeated blocks should stride at once, (stride of S^R when this flag is False) or just the last repeated block should stride (stride of S when this flag is True).

  • future_context

    Int value that determins how many “right” / “future” context frames will be utilized when calculating the output of the conv kernel. All calculations are done for odd kernel sizes only.

    By default, this is -1, which is recomputed as the symmetric padding case.

    When future_context >= 0, will compute the asymmetric padding as follows : (left context, right context) = [K - 1 - future_context, future_context]

    Determining an exact formula to limit future context is dependent on global layout of the model. As such, we provide both “local” and “global” guidelines below.

    Local context limit (should always be enforced) - future context should be <= half the kernel size for any given layer - future context > kernel size defaults to symmetric kernel - future context of layer = number of future frames * width of each frame (dependent on stride)

    Global context limit (should be carefully considered) - future context should be layed out in an ever reducing pattern. Initial layers should restrict future context less than later layers, since shallow depth (and reduced stride) means each frame uses less amounts of future context. - Beyond a certain point, future context should remain static for a given stride level. This is the upper bound of the amount of future context that can be provided to the model on a global scale. - future context is calculated (roughly) as - (2 ^ stride) * (K // 2) number of future frames. This resultant value should be bound to some global maximum number of future seconds of audio (in ms).

    Note: In the special case where K < future_context, it is assumed that the kernel is too small to limit its future context, so symmetric padding is used instead.

    Note: There is no explicit limitation on the amount of future context used, as long as K > future_context constraint is maintained. This might lead to cases where future_context is more than half the actual kernel size K! In such cases, the conv layer is utilizing more of the future context than its current and past context to compute the output. While this is possible to do, it is not recommended and the layer will raise a warning to notify the user of such cases. It is advised to simply use symmetric padding for such cases.

    Example: Say we have a model that performs 8x stride and receives spectrogram frames with stride of 0.01s. Say we wish to upper bound future context to 80 ms.

    Layer ID, Kernel Size, Stride, Future Context, Global Context 0, K=5, S=1, FC=8, GC= 2 * (2^0) = 2 * 0.01 ms (special case, K < FC so use symmetric pad) 1, K=7, S=1, FC=3, GC= 3 * (2^0) = 3 * 0.01 ms (note that symmetric pad here uses 3 FC frames!) 2, K=11, S=2, FC=4, GC= 4 * (2^1) = 8 * 0.01 ms (note that symmetric pad here uses 5 FC frames!) 3, K=15, S=1, FC=4, GC= 4 * (2^1) = 8 * 0.01 ms (note that symmetric pad here uses 7 FC frames!) 4, K=21, S=2, FC=2, GC= 2 * (2^2) = 8 * 0.01 ms (note that symmetric pad here uses 10 FC frames!) 5, K=25, S=2, FC=1, GC= 1 * (2^3) = 8 * 0.01 ms (note that symmetric pad here uses 14 FC frames!) 6, K=29, S=1, FC=1, GC= 1 * (2^3) = 8 * 0.01 ms …

  • quantize – Bool flag whether to quantize the Convolutional blocks.

forward(input_: Tuple[List[torch.Tensor], Optional[torch.Tensor]]) Tuple[List[torch.Tensor], Optional[torch.Tensor]][source]#

Forward pass of the module.

Parameters

input – The input is a tuple of two values - the preprocessed audio signal as well as the lengths of the audio signal. The audio signal is padded to the shape [B, D, T] and the lengths are a torch vector of length B.

Returns

The output of the block after processing the input through repeat number of sub-blocks, as well as the lengths of the encoded audio after padding/striding.

Mixins#

Datasets#

Character Encoding Datasets#

Subword Encoding Datasets#

Audio Preprocessors#

Audio Augmentors#

Miscellaneous Classes#

CTC Decoding#

class nemo.collections.asr.metrics.wer.CTCDecoding(decoding_cfg, vocabulary)[source]#

Bases: nemo.collections.asr.metrics.wer.AbstractCTCDecoding

Used for performing CTC auto-regressive / non-auto-regressive decoding of the logprobs for character based models.

Parameters
  • decoding_cfg

    A dict-like object which contains the following key-value pairs. strategy: str value which represents the type of decoding that can occur.

    Possible values are : - greedy (for greedy decoding). - beam (for DeepSpeed KenLM based decoding).

    compute_timestamps: A bool flag, which determines whether to compute the character/subword, or

    word based timestamp mapping the output log-probabilities to discrite intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.

    ctc_timestamp_type: A str value, which represents the types of timestamps that should be calculated.

    Can take the following values - “char” for character/subword time stamps, “word” for word level time stamps and “all” (default), for both character level and word level time stamps.

    word_seperator: Str token representing the seperator between words.

    preserve_alignments: Bool flag which preserves the history of logprobs generated during

    decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, logprobs is a torch.Tensors.

    confidence_cfg: A dict-like object which contains the following key-value pairs related to confidence

    scores. In order to obtain hypotheses with confidence scores, please utilize ctc_decoder_predictions_tensor function with the preserve_frame_confidence flag set to True.

    preserve_frame_confidence: Bool flag which preserves the history of per-frame confidence scores

    generated during decoding. When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of floats.

    preserve_token_confidence: Bool flag which preserves the history of per-token confidence scores

    generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for token_confidence in it. Here, token_confidence is a List of floats.

    The length of the list corresponds to the number of recognized tokens.

    preserve_word_confidence: Bool flag which preserves the history of per-word confidence scores

    generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for word_confidence in it. Here, word_confidence is a List of floats.

    The length of the list corresponds to the number of recognized words.

    exclude_blank: Bool flag indicating that blank token confidence scores are to be excluded

    from the token_confidence.

    aggregation: Which aggregation type to use for collapsing per-token confidence into per-word confidence.

    Valid options are mean, min, max, prod.

    method_cfg: A dict-like object which contains the method name and settings to compute per-frame

    confidence scores.

    name: The method name (str).
    Supported values:
    • ’max_prob’ for using the maximum token probability as a confidence.

    • ’entropy’ for using a normalized entropy of a log-likelihood vector.

    entropy_type: Which type of entropy to use (str).

    Used if confidence_method_cfg.name is set to entropy. Supported values:

    • ’gibbs’ for the (standard) Gibbs entropy. If the temperature α is provided,

      the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the temperature should comply the following inequality: 1/log(V) <= α <= -1/log(1-1/V) where V is the model vocabulary size.

    • ’tsallis’ for the Tsallis entropy with the Boltzmann constant one.

      Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

    • ’renui’ for the Rényi entropy.

      Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

    temperature: Temperature scale for logsoftmax (α for entropies). Here we restrict it to be > 0.

    When the temperature equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

    entropy_norm: A mapping of the entropy value to the interval [0,1].
    Supported values:
    • ’lin’ for using the linear mapping.

    • ’exp’ for using exponential mapping with linear shift.

    batch_dim_index: Index of the batch dimension of targets and predictions parameters of

    ctc_decoder_predictions_tensor methods. Can be either 0 or 1.

    The config may further contain the following sub-dictionaries: “greedy”:

    preserve_alignments: Same as above, overrides above value. compute_timestamps: Same as above, overrides above value. preserve_frame_confidence: Same as above, overrides above value.

    ”beam”:
    beam_size: int, defining the beam size for beam search. Must be >= 1.

    If beam_size == 1, will perform cached greedy search. This might be slightly different results compared to the greedy search above.

    return_best_hypothesis: optional bool, whether to return just the best hypothesis or all of the

    hypotheses after beam search has concluded. This flag is set by default.

    beam_alpha: float, the strength of the Language model on the final score of a token.

    final_score = acoustic_score + beam_alpha * lm_score + beam_beta * seq_length.

    beam_beta: float, the strength of the sequence length penalty on the final score of a token.

    final_score = acoustic_score + beam_alpha * lm_score + beam_beta * seq_length.

    kenlm_path: str, path to a KenLM ARPA or .binary file (depending on the strategy chosen).

    If the path is invalid (file is not found at path), will raise a deferred error at the moment of calculation of beam search, so that users may update / change the decoding strategy to point to the correct file.

  • blank_id – The id of the RNNT blank token.

decode_ids_to_tokens(tokens: List[int]) List[str][source]#

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters

tokens – List of int representing the token ids.

Returns

A list of decoded tokens.

decode_tokens_to_str(tokens: List[int]) str[source]#

Implemented by subclass in order to decoder a token list into a string.

Parameters

tokens – List of int representing the token ids.

Returns

A decoded string.

class nemo.collections.asr.metrics.wer_bpe.CTCBPEDecoding(decoding_cfg, tokenizer: nemo.collections.common.tokenizers.tokenizer_spec.TokenizerSpec)[source]#

Bases: nemo.collections.asr.metrics.wer.AbstractCTCDecoding

Used for performing CTC auto-regressive / non-auto-regressive decoding of the logprobs for subword based models.

Parameters
  • decoding_cfg

    A dict-like object which contains the following key-value pairs. strategy: str value which represents the type of decoding that can occur.

    Possible values are : - greedy (for greedy decoding). - beam (for DeepSpeed KenLM based decoding).

    compute_timestamps: A bool flag, which determines whether to compute the character/subword, or

    word based timestamp mapping the output log-probabilities to discrite intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.

    ctc_timestamp_type: A str value, which represents the types of timestamps that should be calculated.

    Can take the following values - “char” for character/subword time stamps, “word” for word level time stamps and “all” (default), for both character level and word level time stamps.

    word_seperator: Str token representing the seperator between words.

    preserve_alignments: Bool flag which preserves the history of logprobs generated during

    decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, logprobs is a torch.Tensors.

    confidence_cfg: A dict-like object which contains the following key-value pairs related to confidence

    scores. In order to obtain hypotheses with confidence scores, please utilize ctc_decoder_predictions_tensor function with the preserve_frame_confidence flag set to True.

    preserve_frame_confidence: Bool flag which preserves the history of per-frame confidence scores

    generated during decoding. When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of floats.

    preserve_token_confidence: Bool flag which preserves the history of per-token confidence scores

    generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for token_confidence in it. Here, token_confidence is a List of floats.

    The length of the list corresponds to the number of recognized tokens.

    preserve_word_confidence: Bool flag which preserves the history of per-word confidence scores

    generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for word_confidence in it. Here, word_confidence is a List of floats.

    The length of the list corresponds to the number of recognized words.

    exclude_blank: Bool flag indicating that blank token confidence scores are to be excluded

    from the token_confidence.

    aggregation: Which aggregation type to use for collapsing per-token confidence into per-word confidence.

    Valid options are mean, min, max, prod.

    method_cfg: A dict-like object which contains the method name and settings to compute per-frame

    confidence scores.

    name: The method name (str).
    Supported values:
    • ’max_prob’ for using the maximum token probability as a confidence.

    • ’entropy’ for using a normalized entropy of a log-likelihood vector.

    entropy_type: Which type of entropy to use (str).

    Used if confidence_method_cfg.name is set to entropy. Supported values:

    • ’gibbs’ for the (standard) Gibbs entropy. If the temperature α is provided,

      the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the temperature should comply the following inequality: 1/log(V) <= α <= -1/log(1-1/V) where V is the model vocabulary size.

    • ’tsallis’ for the Tsallis entropy with the Boltzmann constant one.

      Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

    • ’renui’ for the Rényi entropy.

      Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

    temperature: Temperature scale for logsoftmax (α for entropies). Here we restrict it to be > 0.

    When the temperature equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

    entropy_norm: A mapping of the entropy value to the interval [0,1].
    Supported values:
    • ’lin’ for using the linear mapping.

    • ’exp’ for using exponential mapping with linear shift.

    batch_dim_index: Index of the batch dimension of targets and predictions parameters of

    ctc_decoder_predictions_tensor methods. Can be either 0 or 1.

    The config may further contain the following sub-dictionaries: “greedy”:

    preserve_alignments: Same as above, overrides above value. compute_timestamps: Same as above, overrides above value. preserve_frame_confidence: Same as above, overrides above value.

    ”beam”:
    beam_size: int, defining the beam size for beam search. Must be >= 1.

    If beam_size == 1, will perform cached greedy search. This might be slightly different results compared to the greedy search above.

    return_best_hypothesis: optional bool, whether to return just the best hypothesis or all of the

    hypotheses after beam search has concluded. This flag is set by default.

    beam_alpha: float, the strength of the Language model on the final score of a token.

    final_score = acoustic_score + beam_alpha * lm_score + beam_beta * seq_length.

    beam_beta: float, the strength of the sequence length penalty on the final score of a token.

    final_score = acoustic_score + beam_alpha * lm_score + beam_beta * seq_length.

    kenlm_path: str, path to a KenLM ARPA or .binary file (depending on the strategy chosen).

    If the path is invalid (file is not found at path), will raise a deferred error at the moment of calculation of beam search, so that users may update / change the decoding strategy to point to the correct file.

  • tokenizer – NeMo tokenizer object, which inherits from TokenizerSpec.

decode_ids_to_tokens(tokens: List[int]) List[str][source]#

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters

tokens – List of int representing the token ids.

Returns

A list of decoded tokens.

decode_tokens_to_str(tokens: List[int]) str[source]#

Implemented by subclass in order to decoder a token list into a string.

Parameters

tokens – List of int representing the token ids.

Returns

A decoded string.

class nemo.collections.asr.parts.submodules.ctc_greedy_decoding.GreedyCTCInfer(blank_id: int, preserve_alignments: bool = False, compute_timestamps: bool = False, preserve_frame_confidence: bool = False, confidence_method_cfg: Optional[omegaconf.DictConfig] = None)[source]#

Bases: nemo.core.classes.common.Typing, nemo.collections.asr.parts.utils.asr_confidence_utils.ConfidenceMeasureMixin

A greedy CTC decoder.

Provides a common abstraction for sample level and batch level greedy decoding.

Parameters
  • blank_index – int index of the blank token. Can be 0 or len(vocabulary).

  • preserve_alignments – Bool flag which preserves the history of logprobs generated during decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, logprobs is a torch.Tensors.

  • compute_timestamps – A bool flag, which determines whether to compute the character/subword, or word based timestamp mapping the output log-probabilities to discrite intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.

  • preserve_frame_confidence – Bool flag which preserves the history of per-frame confidence scores generated during decoding. When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of floats.

  • confidence_method_cfg

    A dict-like object which contains the method name and settings to compute per-frame confidence scores.

    name: The method name (str).
    Supported values:
    • ’max_prob’ for using the maximum token probability as a confidence.

    • ’entropy’ for using a normalized entropy of a log-likelihood vector.

    entropy_type: Which type of entropy to use (str). Used if confidence_method_cfg.name is set to entropy.
    Supported values:
    • ’gibbs’ for the (standard) Gibbs entropy. If the temperature α is provided,

      the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the temperature should comply the following inequality: 1/log(V) <= α <= -1/log(1-1/V) where V is the model vocabulary size.

    • ’tsallis’ for the Tsallis entropy with the Boltzmann constant one.

      Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

    • ’renui’ for the Rényi entropy.

      Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

    temperature: Temperature scale for logsoftmax (α for entropies). Here we restrict it to be > 0.

    When the temperature equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

    entropy_norm: A mapping of the entropy value to the interval [0,1].
    Supported values:
    • ’lin’ for using the linear mapping.

    • ’exp’ for using exponential mapping with linear shift.

forward(decoder_output: torch.Tensor, decoder_lengths: torch.Tensor)[source]#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-repressively.

Parameters
  • decoder_output – A tensor of size (batch, timesteps, features) or (batch, timesteps) (each timestep is a label).

  • decoder_lengths – list of int representing the length of each sequence output sequence.

Returns

packed list containing batch number of sentences (Hypotheses).

property input_types#

Returns definitions of module input ports.

property output_types#

Returns definitions of module output ports.

class nemo.collections.asr.parts.submodules.ctc_beam_decoding.BeamCTCInfer(blank_id: int, beam_size: int, search_type: str = 'default', return_best_hypothesis: bool = True, preserve_alignments: bool = False, compute_timestamps: bool = False, beam_alpha: float = 1.0, beam_beta: float = 0.0, kenlm_path: Optional[str] = None, flashlight_cfg: Optional[nemo.collections.asr.parts.submodules.ctc_beam_decoding.FlashlightConfig] = None)[source]#

Bases: nemo.collections.asr.parts.submodules.ctc_beam_decoding.AbstractBeamCTCInfer

A greedy CTC decoder.

Provides a common abstraction for sample level and batch level greedy decoding.

Parameters
  • blank_index – int index of the blank token. Can be 0 or len(vocabulary).

  • preserve_alignments – Bool flag which preserves the history of logprobs generated during decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, logprobs is a torch.Tensors.

  • compute_timestamps – A bool flag, which determines whether to compute the character/subword, or word based timestamp mapping the output log-probabilities to discrite intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.

Parameters
  • x – Tensor of shape [B, T, V+1]

  • out_len

Returns:

Parameters
  • x – Tensor of shape [B, T, V+1]

  • out_len

Returns:

forward(decoder_output: torch.Tensor, decoder_lengths: torch.Tensor) Tuple[List[Union[nemo.collections.asr.parts.utils.rnnt_utils.Hypothesis, nemo.collections.asr.parts.utils.rnnt_utils.NBestHypotheses]]][source]#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-repressively.

Parameters
  • decoder_output – A tensor of size (batch, timesteps, features).

  • decoder_lengths – list of int representing the length of each sequence output sequence.

Returns

packed list containing batch number of sentences (Hypotheses).

set_decoding_type(decoding_type: str)[source]#

Sets the decoding type of the framework. Can support either char or subword models.

Parameters

decoding_type – Str corresponding to decoding type. Only supports “char” and “subword”.

RNNT Decoding#

Hypotheses#

class nemo.collections.asr.parts.utils.rnnt_utils.Hypothesis(score: float, y_sequence: typing.Union[typing.List[int], torch.Tensor], text: typing.Optional[str] = None, dec_out: typing.Optional[typing.List[torch.Tensor]] = None, dec_state: typing.Optional[typing.Union[typing.List[typing.List[torch.Tensor]], typing.List[torch.Tensor]]] = None, timestep: typing.Union[typing.List[int], torch.Tensor] = <factory>, alignments: typing.Optional[typing.Union[typing.List[int], typing.List[typing.List[int]]]] = None, frame_confidence: typing.Optional[typing.Union[typing.List[float], typing.List[typing.List[float]]]] = None, token_confidence: typing.Optional[typing.List[float]] = None, word_confidence: typing.Optional[typing.List[float]] = None, length: typing.Union[int, torch.Tensor] = 0, y: typing.Optional[typing.List[torch.tensor]] = None, lm_state: typing.Optional[typing.Union[typing.Dict[str, typing.Any], typing.List[typing.Any]]] = None, lm_scores: typing.Optional[torch.Tensor] = None, tokens: typing.Optional[typing.Union[typing.List[int], torch.Tensor]] = None, last_token: typing.Optional[torch.Tensor] = None)[source]#

Bases: object

Hypothesis class for beam search algorithms.

score: A float score obtained from an AbstractRNNTDecoder module’s score_hypothesis method.

y_sequence: Either a sequence of integer ids pointing to some vocabulary, or a packed torch.Tensor

behaving in the same manner. dtype must be torch.Long in the latter case.

dec_state: A list (or list of list) of LSTM-RNN decoder states. Can be None.

text: (Optional) A decoded string after processing via CTC / RNN-T decoding (removing the CTC/RNNT

blank tokens, and optionally merging word-pieces). Should be used as decoded string for Word Error Rate calculation.

timestep: (Optional) A list of integer indices representing at which index in the decoding

process did the token appear. Should be of same length as the number of non-blank tokens.

alignments: (Optional) Represents the CTC / RNNT token alignments as integer tokens along an axis of

time T (for CTC) or Time x Target (TxU). For CTC, represented as a single list of integer indices. For RNNT, represented as a dangling list of list of integer indices. Outer list represents Time dimension (T), inner list represents Target dimension (U). The set of valid indices includes the CTC / RNNT blank token in order to represent alignments.

frame_confidence: (Optional) Represents the CTC / RNNT per-frame confidence scores as token probabilities

along an axis of time T (for CTC) or Time x Target (TxU). For CTC, represented as a single list of float indices. For RNNT, represented as a dangling list of list of float indices. Outer list represents Time dimension (T), inner list represents Target dimension (U).

token_confidence: (Optional) Represents the CTC / RNNT per-token confidence scores as token probabilities

along an axis of Target U. Represented as a single list of float indices.

word_confidence: (Optional) Represents the CTC / RNNT per-word confidence scores as token probabilities

along an axis of Target U. Represented as a single list of float indices.

length: Represents the length of the sequence (the original length without padding), otherwise

defaults to 0.

y: (Unused) A list of torch.Tensors representing the list of hypotheses.

lm_state: (Unused) A dictionary state cache used by an external Language Model.

lm_scores: (Unused) Score of the external Language Model.

tokens: (Optional) A list of decoded tokens (can be characters or word-pieces.

last_token (Optional): A token or batch of tokens which was predicted in the last step.

class nemo.collections.asr.parts.utils.rnnt_utils.NBestHypotheses(n_best_hypotheses: Optional[List[nemo.collections.asr.parts.utils.rnnt_utils.Hypothesis]])[source]#

Bases: object

List of N best hypotheses

Adapter Networks#

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.MultiHeadAttentionAdapter(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.asr.parts.submodules.multi_head_attention.MultiHeadAttention, nemo.collections.common.parts.adapter_modules.AdapterModuleUtil

Multi-Head Attention layer of Transformer. :param n_head: number of heads :type n_head: int :param n_feat: size of the features :type n_feat: int :param dropout_rate: dropout rate :type dropout_rate: float :param proj_dim: Optional integer value for projection before computing attention.

If None, then there is no projection (equivalent to proj_dim = n_feat). If > 0, then will project the n_feat to proj_dim before calculating attention. If <0, then will equal n_head, so that each head has a projected dimension of 1.

forward(query, key, value, mask, pos_emb=None, cache=None, cache_next=None)[source]#

Compute ‘Scaled Dot Product Attention’. :param query: (batch, time1, size) :type query: torch.Tensor :param key: (batch, time2, size) :type key: torch.Tensor :param value: (batch, time2, size) :type value: torch.Tensor :param mask: (batch, time1, time2) :type mask: torch.Tensor :param cache: (cache_nums, batch, time_cache, size) :type cache: torch.Tensor :param cache_next: (cache_nums, batch, time_cache_next, size) :type cache_next: torch.Tensor

Returns

transformed value (batch, time1, d_model) weighted by the query dot key attention

Return type

output (torch.Tensor)

reset_parameters()[source]#
get_default_strategy_config() dataclasses.dataclass[source]#

Returns a default adapter module strategy.


class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.RelPositionMultiHeadAttentionAdapter(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.asr.parts.submodules.multi_head_attention.RelPositionMultiHeadAttention, nemo.collections.common.parts.adapter_modules.AdapterModuleUtil

Multi-Head Attention layer of Transformer-XL with support of relative positional encoding. Paper: https://arxiv.org/abs/1901.02860 :param n_head: number of heads :type n_head: int :param n_feat: size of the features :type n_feat: int :param dropout_rate: dropout rate :type dropout_rate: float :param proj_dim: Optional integer value for projection before computing attention.

If None, then there is no projection (equivalent to proj_dim = n_feat). If > 0, then will project the n_feat to proj_dim before calculating attention. If <0, then will equal n_head, so that each head has a projected dimension of 1.

Parameters

adapter_strategy – By default, MHAResidualAddAdapterStrategyConfig. An adapter composition function object.

forward(query, key, value, mask, pos_emb, cache=None, cache_next=None)[source]#

Compute ‘Scaled Dot Product Attention’ with rel. positional encoding. :param query: (batch, time1, size) :type query: torch.Tensor :param key: (batch, time2, size) :type key: torch.Tensor :param value: (batch, time2, size) :type value: torch.Tensor :param mask: (batch, time1, time2) :type mask: torch.Tensor :param pos_emb: (batch, time1, size) :type pos_emb: torch.Tensor :param cache: (cache_nums, batch, time_cache, size) :type cache: torch.Tensor :param cache_next: (cache_nums, batch, time_cache_next, size) :type cache_next: torch.Tensor

Returns

transformed value (batch, time1, d_model) weighted by the query dot key attention

Return type

output (torch.Tensor)

reset_parameters()[source]#
get_default_strategy_config() dataclasses.dataclass[source]#

Returns a default adapter module strategy.


class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.PositionalEncodingAdapter(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.asr.parts.submodules.multi_head_attention.PositionalEncoding, nemo.collections.common.parts.adapter_modules.AdapterModuleUtil

Absolute positional embedding adapter.

Note

Absolute positional embedding value is added to the input tensor without residual connection ! Therefore, the input is changed, if you only require the positional embedding, drop the returned x !

Parameters
  • d_model (int) – The input dimension of x.

  • max_len (int) – The max sequence length.

  • xscale (float) – The input scaling factor. Defaults to 1.0.

  • adapter_strategy (AbstractAdapterStrategy) – By default, ReturnResultAdapterStrategyConfig. An adapter composition function object. NOTE: Since this is a positional encoding, it will not add a residual !

get_default_strategy_config() dataclasses.dataclass[source]#

Returns a default adapter module strategy.


class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.RelPositionalEncodingAdapter(*args: Any, **kwargs: Any)[source]#

Bases: nemo.collections.asr.parts.submodules.multi_head_attention.RelPositionalEncoding, nemo.collections.common.parts.adapter_modules.AdapterModuleUtil

Relative positional encoding for TransformerXL’s layers See : Appendix B in https://arxiv.org/abs/1901.02860

Note

Relative positional embedding value is not added to the input tensor ! Therefore, the input should be updated changed, if you only require the positional embedding, drop the returned x !

Parameters
  • d_model (int) – embedding dim

  • max_len (int) – maximum input length

  • xscale (bool) – whether to scale the input by sqrt(d_model)

  • adapter_strategy – By default, ReturnResultAdapterStrategyConfig. An adapter composition function object.

get_default_strategy_config() dataclasses.dataclass[source]#

Returns a default adapter module strategy.

Adapter Strategies#

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.MHAResidualAddAdapterStrategy(stochastic_depth: float = 0.0, l2_lambda: float = 0.0)[source]#

Bases: nemo.core.classes.mixins.adapter_mixin_strategies.ResidualAddAdapterStrategy

An implementation of residual addition of an adapter module with its input for the MHA Adapters.

forward(input: torch.Tensor, adapter: torch.nn.Module, *, module: AdapterModuleMixin)[source]#

A basic strategy, comprising of a residual connection over the input, after forward pass by the underlying adapter. Additional work is done to pack and unpack the dictionary of inputs and outputs.

Note: The value tensor is added to the output of the attention adapter as the residual connection.

Parameters
  • input

    A dictionary of multiple input arguments for the adapter module. query, key, value: Original output tensor of the module, or the output of the

    previous adapter (if more than one adapters are enabled). mask: Attention mask. pos_emb: Optional positional embedding for relative encoding.

  • adapter – The adapter module that is currently required to perform the forward pass.

  • module – The calling module, in its entirety. It is a module that implements AdapterModuleMixin, therefore the strategy can access all other adapters in this module via module.adapter_layer.

Returns

The result tensor, after one of the active adapters has finished its forward passes.

compute_output(input: torch.Tensor, adapter: torch.nn.Module, *, module: AdapterModuleMixin) torch.Tensor[source]#

Compute the output of a single adapter to some input.

Parameters
  • input – Original output tensor of the module, or the output of the previous adapter (if more than one adapters are enabled).

  • adapter – The adapter module that is currently required to perform the forward pass.

  • module – The calling module, in its entirety. It is a module that implements AdapterModuleMixin, therefore the strategy can access all other adapters in this module via module.adapter_layer.

Returns

The result tensor, after one of the active adapters has finished its forward passes.