NeMo ASR API#

Model Classes#

Modules#

class nemo.collections.asr.modules.ConvASREncoder(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable, AccessMixin

Convolutional encoder for ASR models. With this class you can implement JasperNet and QuartzNet models.

Based on these papers:: https://arxiv.org/pdf/1904.03288.pdf https://arxiv.org/pdf/1910.10261.pdf

input_example(max_batch=1, max_dim=8192)#: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#: Returns definitions of module input ports.

property output_types#: Returns definitions of module output ports.

update_max_sequence_length(seq_length: int, device)#: Find global max audio length across all nodes in distributed training and update the max_audio_length

class nemo.collections.asr.modules.ConvASRDecoder(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable, AdapterModuleMixin

Simple ASR Decoder for use with CTC-based models such as JasperNet and QuartzNet

Based on these papers:: https://arxiv.org/pdf/1904.03288.pdf https://arxiv.org/pdf/1910.10261.pdf https://arxiv.org/pdf/2005.04290.pdf

add_adapter(name: str, cfg: omegaconf.DictConfig)#

Add an Adapter module to this module.

Parameters:

name – A globally unique name for the adapter. Will be used to access, enable and disable adapters.
cfg – A DictConfig or Dataclass that contains at the bare minimum __target__ to instantiate a new Adapter module.

input_example(max_batch=1, max_dim=256)#: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#: Define these to enable input neural type checks

property output_types#: Define these to enable output neural type checks

class nemo.collections.asr.modules.ConvASRDecoderClassification(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable

Simple ASR Decoder for use with classification models such as JasperNet and QuartzNet

Based on these papers:: https://arxiv.org/pdf/2005.04290.pdf

input_example(max_batch=1, max_dim=256)#: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#: Define these to enable input neural type checks

property output_types#: Define these to enable output neural type checks

class nemo.collections.asr.modules.SpeakerDecoder(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable

Speaker Decoder creates the final neural layers that maps from the outputs of Jasper Encoder to the embedding layer followed by speaker based softmax loss.

Parameters:

feat_in (int) – Number of channels being input to this module
num_classes (int) – Number of unique speakers in dataset
emb_sizes (list) – shapes of intermediate embedding layers (we consider speaker embbeddings from 1st of this layers). Defaults to [1024,1024]
pool_mode (str) – Pooling strategy type. options are ‘xvector’,’tap’, ‘attention’ Defaults to ‘xvector (mean and variance)’ tap (temporal average pooling: just mean) attention (attention based pooling)
init_mode (str) – Describes how neural network parameters are initialized. Options are [‘xavier_uniform’, ‘xavier_normal’, ‘kaiming_uniform’,’kaiming_normal’]. Defaults to “xavier_uniform”.

input_example(max_batch=1, max_dim=256)#: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#: Define these to enable input neural type checks

property output_types#: Define these to enable output neural type checks

class nemo.collections.asr.modules.ConformerEncoder(*args: Any, **kwargs: Any)#

Bases: NeuralModule, StreamingEncoder, Exportable, AccessMixin

The encoder for ASR model of Conformer. Based on this paper: ‘Conformer: Convolution-augmented Transformer for Speech Recognition’ by Anmol Gulati et al. https://arxiv.org/abs/2005.08100

Parameters:

feat_in (int) – the size of feature channels
n_layers (int) – number of layers of ConformerBlock
d_model (int) – the hidden size of the model
feat_out (int) – the size of the output features Defaults to -1 (means feat_out is d_model)
subsampling (str) – the method of subsampling: choices = [‘vggnet’, ‘striding’, ‘dw-striding’, ‘stacking’, ‘stacking_norm’] Defaults to striding.
subsampling_factor (int) – the subsampling factor which should be power of 2 Defaults to 4.
subsampling_conv_chunking_factor (int) – optionally, force chunk inputs (helpful for large inputs) Should be power of 2, 1 (auto-chunking, default), or -1 (no chunking)
subsampling_conv_channels (int) – the size of the convolutions in the subsampling module Defaults to -1 which would set it to d_model.
reduction (str, Optional) – the method of reduction, choices=[‘pooling’, ‘striding’]. If no value is passed, then no reduction is performed and the models runs with the original 4x subsampling.
reduction_position (int, Optional) – the index of the layer to apply reduction. If -1, apply reduction at the end.
reduction_factor (int) – the reduction factor which should be either 1 or a power of 2 Defaults to 1.
ff_expansion_factor (int) – the expansion factor in feed forward layers Defaults to 4.
self_attention_model (str) –
the type of the attention layer and positional encoding.

’rel_pos’:
relative positional embedding and Transformer-XL

’rel_pos_local_attn’:
relative positional embedding and Transformer-XL with local attention using overlapping chunks. Attention context is determined by att_context_size parameter.

’abs_pos’:
absolute positional embedding and Transformer

Default is rel_pos.
pos_emb_max_len (int) – the maximum length of positional embeddings Defaults to 5000
n_heads (int) – number of heads in multi-headed attention layers Defaults to 4.
att_context_size (List[Union[List[int],int]]) – specifies the context sizes on each side. Each context size should be a list of two integers like [100, 100]. A list of context sizes like [[100,100], [100,50]] can also be passed. -1 means unlimited context. Defaults to [-1, -1]
att_context_probs (List[float]) – a list of probabilities of each one of the att_context_size when a list of them is passed. If not specified, uniform distribution is being used. Defaults to None
att_context_style (str) – ‘regular’ or ‘chunked_limited’. Defaults to ‘regular’
xscaling (bool) – enables scaling the inputs to the multi-headed attention layers by sqrt(d_model). Defaults to True.
untie_biases (bool) – whether to not share (untie) the bias weights between layers of Transformer-XL Defaults to True.
conv_kernel_size (int) – the size of the convolutions in the convolutional modules Defaults to 31.
conv_norm_type (str) – the type of the normalization in the convolutional modules Defaults to ‘batch_norm’.
conv_context_size (list) – it can be”causal” or a list of two integers while conv_context_size[0]+conv_context_size[1]+1==conv_kernel_size. None means [(conv_kernel_size-1)//2, (conv_kernel_size-1)//2], and ‘causal’ means [(conv_kernel_size-1), 0]. Defaults to None.
conv_dual_mode (bool) – specifies if convolution should be dual mode when dual_offline mode is being used. When enables, the left half of the convolution kernel would get masked in streaming cases. Defaults to False.
use_bias (bool) – Use bias in all Linear and Conv1d layers from each ConformerLayer to improve activation flow and stabilize training of huge models. Defaults to True.
dropout (float) – the dropout rate used in all layers except the attention layers Defaults to 0.1.
dropout_pre_encoder (float) – the dropout rate used before the encoder Defaults to 0.1.
dropout_emb (float) – the dropout rate used for the positional embeddings Defaults to 0.1.
dropout_att (float) – the dropout rate used for the attention layer Defaults to 0.0.
stochastic_depth_drop_prob (float) – if non-zero, will randomly drop layers during training. The higher this value, the more often layers are dropped. Defaults to 0.0.
stochastic_depth_mode (str) – can be either “linear” or “uniform”. If set to “uniform”, all layers have the same probability of drop. If set to “linear”, the drop probability grows linearly from 0 for the first layer to the desired value for the final layer. Defaults to “linear”.
stochastic_depth_start_layer (int) – starting layer for stochastic depth. All layers before this will never be dropped. Note that drop probability will be adjusted accordingly if mode is “linear” when start layer is > 1. Defaults to 1.
global_tokens (int) – number of tokens to be used for global attention. Only relevant if self_attention_model is ‘rel_pos_local_attn’. Defaults to 0.
global_tokens_spacing (int) – how far apart the global tokens are Defaults to 1.
global_attn_separate (bool) – whether the q, k, v layers used for global tokens should be separate. Defaults to False.
use_pytorch_sdpa (bool) – use torch sdpa instead of manual attention. Defaults to False.
use_pytorch_sdpa_backends (list[str]) – list of backend names to use in sdpa. None or empty list means all backends. e.g. [“MATH”] Defaults to None.
bypass_pre_encode – if True, skip the pre-encoder module and the audio_signal should be pre-encoded embeddings. The audio_signal input supports two formats depending on the bypass_pre_encode boolean flag. This determines the required format of the input variable audio_signal. Defaults to bypass_pre_encode=False. bypass_pre_encode=True is used for the cases where frame-level, context-independent embeddings are needed to be saved or reused. (e.g., speaker cache in streaming speaker diarization)
sync_max_audio_length (bool) – when true, performs NCCL all_reduce to allocate the same amount of memory for positional encoding buffers on all GPUs. Disabling this setting may help with deadlocks in certain scenarios such as model parallelism, or generally when this module is not being ran on some GPUs as a part of the training step.

change_attention_model( self_attention_model: str | None = None, att_context_size: List[int] | None = None, update_config: bool = True, device: torch.device | None = None, )#

Update the self_attention_model which changes the positional encoding and attention layers.

Parameters:

self_attention_model (str) –
type of the attention layer and positional encoding

’rel_pos’:
relative positional embedding and Transformer-XL

’rel_pos_local_attn’:
relative positional embedding and Transformer-XL with local attention using overlapping windows. Attention context is determined by att_context_size parameter.

’abs_pos’:
absolute positional embedding and Transformer

If None is provided, the self_attention_model isn’t changed. Defaults to None.
att_context_size (List[int]) – List of 2 ints corresponding to left and right attention context sizes, or None to keep as it is. Defaults to None.
update_config (bool) – Whether to update the config or not with the new attention model. Defaults to True.
device (torch.device) – If provided, new layers will be moved to the device. Defaults to None.

change_subsampling_conv_chunking_factor( subsampling_conv_chunking_factor: int, )#

Update the conv_chunking_factor (int) Default is 1 (auto) Set it to -1 (disabled) or to a specific value (power of 2) if you OOM in the conv subsampling layers

Parameters:: subsampling_conv_chunking_factor (int)

property disabled_deployment_input_names#: Implement this method to return a set of input names disabled for export

property disabled_deployment_output_names#: Implement this method to return a set of output names disabled for export

enable_pad_mask(on=True)#

Enables or disables the pad mask and assign the boolean state on.

Returns:: The current state of the pad mask.
Return type:: mask (bool)

forward( audio_signal, length, cache_last_channel=None, cache_last_time=None, cache_last_channel_len=None, bypass_pre_encode=False, )#

Forward function for the ConformerEncoder accepting an audio signal and its corresponding length. The audio_signal input supports two formats depending on the bypass_pre_encode boolean flag. This determines the required format of the input variable audio_signal: (1) bypass_pre_encode = False (default):

audio_signal must be a tensor containing audio features. Shape: (batch, self._feat_in, n_frames)

bypass_pre_encode = True: audio_signal must be a tensor containing pre-encoded embeddings. Shape: (batch, n_frame, self.d_model)

forward_for_export( audio_signal, length, cache_last_channel=None, cache_last_time=None, cache_last_channel_len=None, )#: Forward function for model export. Please see forward() for more details.

forward_internal( audio_signal, length, cache_last_channel=None, cache_last_time=None, cache_last_channel_len=None, bypass_pre_encode=False, )#

The audio_signal input supports two formats depending on the bypass_pre_encode boolean flag. This determines the required format of the input variable audio_signal: (1) bypass_pre_encode = False (default):

audio_signal must be a tensor containing audio features. Shape: (batch, self._feat_in, n_frames)

bypass_pre_encode = True: audio_signal must be a tensor containing pre-encoded embeddings. Shape: (batch, n_frame, self.d_model)

bypass_pre_encode=True is used in cases where frame-level, context-independent embeddings are needed to be saved or reused (e.g., speaker cache in streaming speaker diarization).

input_example(max_batch=1, max_dim=256)#: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#: Returns definitions of module input ports.

property input_types_for_export#: Returns definitions of module input ports.

property output_types#: Returns definitions of module output ports.

property output_types_for_export#: Returns definitions of module output ports.

set_default_att_context_size(att_context_size)#

Sets the default attention context size from att_context_size argument.

Parameters:: att_context_size (list) – The attention context size to be set.

set_max_audio_length(max_audio_length)#

Sets maximum input length. Pre-calculates internal seq_range mask.

Parameters:: max_audio_length (int) – New maximum sequence length.

setup_streaming_params( chunk_size: int | None = None, shift_size: int | None = None, left_chunks: int | None = None, att_context_size: list | None = None, max_context: int = 10000, )#

This function sets the needed values and parameters to perform streaming. The configuration would be stored in self.streaming_cfg. The streaming configuration is needed to simulate streaming inference.

Parameters:

chunk_size (int) – overrides the chunk size
shift_size (int) – overrides the shift size for chunks
left_chunks (int) – overrides the number of left chunks visible to each chunk
max_context (int) – the value used for the cache size of last_channel layers if left context is set to infinity (-1) Defaults to -1 (means feat_out is d_model)

streaming_post_process(rets, keep_all_outputs=True)#

Post-process the output of the forward function for streaming.

Parameters:

rets – The output of the forward function.
keep_all_outputs – Whether to keep all outputs.

update_max_seq_length(seq_length: int, device)#

Updates the maximum sequence length for the model.

Parameters:

seq_length (int) – New maximum sequence length.
device (torch.device) – Device to use for computations.

class nemo.collections.asr.modules.SqueezeformerEncoder(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable, AccessMixin

The encoder for ASR model of Squeezeformer. Based on this paper: ‘Squeezeformer: An Efficient Transformer for Automatic Speech Recognition’ by Sehoon Kim et al. https://arxiv.org/abs/2206.00888

Parameters:

feat_in (int) – the size of feature channels
n_layers (int) – number of layers of ConformerBlock
d_model (int) – the hidden size of the model
feat_out (int) – the size of the output features Defaults to -1 (means feat_out is d_model)
subsampling (str) – the method of subsampling, choices=[‘vggnet’, ‘striding’, ‘dw_striding’] Defaults to dw_striding.
subsampling_factor (int) – the subsampling factor which should be power of 2 Defaults to 4.
subsampling_conv_channels (int) – the size of the convolutions in the subsampling module Defaults to -1 which would set it to d_model.
ff_expansion_factor (int) – the expansion factor in feed forward layers Defaults to 4.
self_attention_model (str) – type of the attention layer and positional encoding ‘rel_pos’: relative positional embedding and Transformer-XL ‘abs_pos’: absolute positional embedding and Transformer default is rel_pos.
pos_emb_max_len (int) – the maximum length of positional embeddings Defaulst to 5000
n_heads (int) – number of heads in multi-headed attention layers Defaults to 4.
xscaling (bool) – enables scaling the inputs to the multi-headed attention layers by sqrt(d_model) Defaults to True.
untie_biases (bool) – whether to not share (untie) the bias weights between layers of Transformer-XL Defaults to True.
conv_kernel_size (int) – the size of the convolutions in the convolutional modules Defaults to 31.
conv_norm_type (str) – the type of the normalization in the convolutional modules Defaults to ‘batch_norm’.
dropout (float) – the dropout rate used in all layers except the attention layers Defaults to 0.1.
dropout_emb (float) – the dropout rate used for the positional embeddings Defaults to 0.1.
dropout_att (float) – the dropout rate used for the attention layer Defaults to 0.0.
adaptive_scale (bool) – Whether to scale the inputs to each component by affine scale and bias layer. Or use a fixed scale=1 and bias=0.
time_reduce_idx (int) – Optional integer index of a layer where a time reduction operation will occur. All operations beyond this point will only occur at the reduced resolution.
time_recovery_idx (int) – Optional integer index of a layer where the time recovery operation will occur. All operations beyond this point will occur at the original resolution (resolution after primary downsampling). If no value is provided, assumed to be the last layer.

input_example(max_batch=1, max_dim=256)#: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#: Returns definitions of module input ports.

make_pad_mask(max_audio_length, seq_lens)#: Make masking for padding.

property output_types#: Returns definitions of module output ports.

set_max_audio_length(max_audio_length)#: Sets maximum input length. Pre-calculates internal seq_range mask.

class nemo.collections.asr.modules.RNNEncoder(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable

The RNN-based encoder for ASR models. Followed the architecture suggested in the following paper: ‘STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES’ by Yanzhang He et al. https://arxiv.org/pdf/1811.06621.pdf

Parameters:

feat_in (int) – the size of feature channels
n_layers (int) – number of layers of RNN
d_model (int) – the hidden size of the model
proj_size (int) – the size of the output projection after each RNN layer
rnn_type (str) – the type of the RNN layers, choices=[‘lstm, ‘gru’, ‘rnn’]
bidirectional (float) – specifies whether RNN layers should be bidirectional or not Defaults to True.
feat_out (int) – the size of the output features Defaults to -1 (means feat_out is d_model)
subsampling (str) – the method of subsampling, choices=[‘stacking, ‘vggnet’, ‘striding’] Defaults to stacking.
subsampling_factor (int) – the subsampling factor Defaults to 4.
subsampling_conv_channels (int) – the size of the convolutions in the subsampling module for vggnet and striding Defaults to -1 which would set it to d_model.
dropout (float) – the dropout rate used between all layers Defaults to 0.2.

input_example()#: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#: Returns definitions of module input ports.

property output_types#: Returns definitions of module output ports.

class nemo.collections.asr.modules.RNNTDecoder(*args: Any, **kwargs: Any)#

Bases: AbstractRNNTDecoder, Exportable, AdapterModuleMixin

A Recurrent Neural Network Transducer Decoder / Prediction Network (RNN-T Prediction Network). An RNN-T Decoder/Prediction network, comprised of a stateful LSTM model.

Parameters:

prednet –
A dict-like object which contains the following key-value pairs.

pred_hidden:
int specifying the hidden dimension of the prediction net.

pred_rnn_layers:
int specifying the number of rnn layers.

Optionally, it may also contain the following:

forget_gate_bias:
float, set by default to 1.0, which constructs a forget gate initialized to 1.0. Reference: [An Empirical Exploration of Recurrent Network Architectures](http://proceedings.mlr.press/v37/jozefowicz15.pdf)

t_max:
int value, set to None by default. If an int is specified, performs Chrono Initialization of the LSTM network, based on the maximum number of timesteps t_max expected during the course of training. Reference: [Can recurrent neural networks warp time?](https://openreview.net/forum?id=SJcKhk-Ab)

weights_init_scale:
Float scale of the weights after initialization. Setting to lower than one sometimes helps reduce variance between runs.

hidden_hidden_bias_scale:
Float scale for the hidden-to-hidden bias scale. Set to 0.0 for the default behaviour.

dropout:
float, set to 0.0 by default. Optional dropout applied at the end of the final LSTM RNN layer.
vocab_size – int, specifying the vocabulary size of the embedding layer of the Prediction network, excluding the RNNT blank token.
normalization_mode – Can be either None, ‘batch’ or ‘layer’. By default, is set to None. Defines the type of normalization applied to the RNN layer.
random_state_sampling – bool, set to False by default. When set, provides normal-distribution sampled state tensors instead of zero tensors during training. Reference: [Recognizing long-form speech using streaming end-to-end models](https://arxiv.org/abs/1910.11455)
blank_as_pad –
bool, set to True by default. When set, will add a token to the Embedding layer of this prediction network, and will treat this token as a pad token. In essence, the RNNT pad token will be treated as a pad token, and the embedding layer will return a zero tensor for this token.

It is set by default as it enables various batch optimizations required for batched beam search. Therefore, it is not recommended to disable this flag.

add_adapter(name: str, cfg: omegaconf.DictConfig)#

Add an Adapter module to this module.

Parameters:

name – A globally unique name for the adapter. Will be used to access, enable and disable adapters.
cfg – A DictConfig or Dataclass that contains at the bare minimum __target__ to instantiate a new Adapter module.

classmethod batch_aggregate_states_beam( src_states: tuple[torch.Tensor, torch.Tensor], batch_size: int, beam_size: int, indices: torch.Tensor, dst_states: tuple[torch.Tensor, torch.Tensor] | None = None, ) → tuple[torch.Tensor, torch.Tensor]#

Aggregates decoder states based on the given indices. :param src_states: source states of

shape ([L x (batch_size * beam_size, H)], [L x (batch_size * beam_size, H)])

Parameters:

batch_size (int) – The size of the batch.
beam_size (int) – The size of the beam.
indices (torch.Tensor) – A tensor of shape (batch_size, beam_size) containing the indices in beam that map the source states to the destination states.
dst_states (Optional[Tuple[torch.Tensor, torch.Tensor]]) – If provided, the method updates these tensors in-place.

Return type:

Tuple[torch.Tensor, torch.Tensor]

Note

The indices tensor is expanded to match the shape of the source states

during the gathering operation.

batch_concat_states( batch_states: List[List[torch.Tensor]], ) → List[torch.Tensor]#

Concatenate a batch of decoder state to a packed state.

Parameters:

batch_states (list) – batch of decoder states B x ([L x (H)], [L x (H)])

Returns:

decoder states: (L x B x H, L x B x H)

Return type:

(tuple)

batch_copy_states( old_states: List[torch.Tensor], new_states: List[torch.Tensor], ids: List[int], value: float | None = None, ) → List[torch.Tensor]#

Copy states from new state to old state at certain indices.

Parameters:

old_states (list) – packed decoder states (L x B x H, L x B x H)
new_states – packed decoder states (L x B x H, L x B x H)
ids (list) – List of indices to copy states at.
value (optional float) – If a value should be copied instead of a state slice, a float should be provided

Returns:

batch of decoder states with partial copy at ids (or a specific value).: (L x B x H, L x B x H)

batch_initialize_states( decoder_states: List[List[torch.Tensor]], ) → List[torch.Tensor]#

Creates a stacked decoder states to be passed to prediction network

Parameters:

decoder_states (list of list of list of torch.Tensor) –

list of decoder states [B, C, L, H]

B: Batch size.

C: e.g., for LSTM, this is 2: hidden and cell states

L: Number of layers in prediction RNN.

H: Dimensionality of the hidden state.

Returns:

batch of decoder states: [C x torch.Tensor[L x B x H]

Return type:

batch_states (list of torch.Tensor)

classmethod batch_replace_states_all( src_states: Tuple[torch.Tensor, torch.Tensor], dst_states: Tuple[torch.Tensor, torch.Tensor], batch_size: int | None = None, )#: Replace states in dst_states with states from src_states

classmethod batch_replace_states_mask( src_states: Tuple[torch.Tensor, torch.Tensor], dst_states: Tuple[torch.Tensor, torch.Tensor], mask: torch.Tensor, other_src_states: Tuple[torch.Tensor, torch.Tensor] | None = None, )#

Replaces states in dst_states with states from src_states based on the given mask.

Parameters:

mask (torch.Tensor) – When True, selects values from src_states, otherwise out or `other_src_states`(if provided).
src_states (Tuple[torch.Tensor, torch.Tensor]) – Values selected at indices where mask is True.
dst_states (Tuple[torch.Tensor, torch.Tensor])) – The output states.
other_src_states (Tuple[torch.Tensor, torch.Tensor], optional) – Values selected at indices where mask is False.

Note

This operation is performed without CPU-GPU synchronization by using torch.where.

batch_score_hypothesis( hypotheses: List[Hypothesis], cache: Dict[Tuple[int], Any], ) → Tuple[List[torch.Tensor], List[List[torch.Tensor]]]#

Used for batched beam search algorithms. Similar to score_hypothesis method.

Parameters:

hypothesis – List of Hypotheses. Refer to rnnt_utils.Hypothesis.
cache – Dict which contains a cache to avoid duplicate computations.

Returns:

batch_dec_out: a list of torch.Tensor [1, H] representing the prediction network outputs for the last tokens in the Hypotheses. batch_dec_states: a list of list of RNN states, each of shape [L, B, H]. Represented as B x List[states].

Return type:

Returns a tuple (batch_dec_out, batch_dec_states) such that

batch_select_state( batch_states: List[torch.Tensor], idx: int, ) → List[List[torch.Tensor]]#

Get decoder state from batch of states, for given id.

Parameters:

batch_states (list) – batch of decoder states ([L x (B, H)], [L x (B, H)])
idx (int) – index to extract state from batch of states

Returns:

decoder states for given id: ([L x (1, H)], [L x (1, H)])

Return type:

(tuple)

classmethod batch_split_states( batch_states: tuple[torch.Tensor, torch.Tensor], ) → list[tuple[torch.Tensor, torch.Tensor]]#: Split states into a list of states. Useful for splitting the final state for converting results of the decoding algorithm to Hypothesis class.

classmethod batch_unsplit_states( batch_states: list[tuple[torch.Tensor, torch.Tensor]], device=None, dtype=None, ) → tuple[torch.Tensor, torch.Tensor]#

Concatenate a batch of decoder state to a packed state. Inverse of batch_split_states.

Parameters:

batch_states (list) – batch of decoder states B x ([L x (H)], [L x (H)])

Returns:

decoder states: (L x B x H, L x B x H)

Return type:

(tuple)

classmethod clone_state( state: tuple[torch.Tensor, torch.Tensor], ) → tuple[torch.Tensor, torch.Tensor]#: Return copy of the states

initialize_state( y: torch.Tensor, ) → Tuple[torch.Tensor, torch.Tensor]#

Initialize the state of the LSTM layers, with same dtype and device as input y. LSTM accepts a tuple of 2 tensors as a state.

Parameters:

y – A torch.Tensor whose device the generated states will be placed on.

Returns:

Tuple of 2 tensors, each of shape [L, B, H], where

L = Number of RNN layers

B = Batch size

H = Hidden size of RNN.

input_example(max_batch=1, max_dim=1)#: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#: Returns definitions of module input ports.

mask_select_states( states: Tuple[torch.Tensor, torch.Tensor], mask: torch.Tensor, ) → Tuple[torch.Tensor, torch.Tensor]#

Return states by mask selection :param states: states for the batch :param mask: boolean mask for selecting states; batch dimension should be the same as for states

Returns:: states filtered by mask

property output_types#: Returns definitions of module output ports.

predict( y: torch.Tensor | None = None, state: List[torch.Tensor] | None = None, add_sos: bool = True, batch_size: int | None = None, ) → Tuple[torch.Tensor, List[torch.Tensor]]#

Stateful prediction of scores and state for a (possibly null) tokenset. This method takes various cases into consideration : - No token, no state - used for priming the RNN - No token, state provided - used for blank token scoring - Given token, states - used for scores + new states

Here: B - batch size U - label length H - Hidden dimension size of RNN L - Number of RNN layers

Parameters:

y – Optional torch tensor of shape [B, U] of dtype long which will be passed to the Embedding. If None, creates a zero tensor of shape [B, 1, H] which mimics output of pad-token on EmbeddiNg.
state – An optional list of states for the RNN. Eg: For LSTM, it is the state list length is 2. Each state must be a tensor of shape [L, B, H]. If None, and during training mode and random_state_sampling is set, will sample a normal distribution tensor of the above shape. Otherwise, None will be passed to the RNN.
add_sos – bool flag, whether a zero vector describing a “start of signal” token should be prepended to the above “y” tensor. When set, output size is (B, U + 1, H).
batch_size – An optional int, specifying the batch size of the y tensor. Can be infered if y and state is None. But if both are None, then batch_size cannot be None.

Returns:

A tuple (g, hid) such that -

If add_sos is False:

g:
(B, U, H)

hid:
(h, c) where h is the final sequence hidden state and c is the final cell state:

h (tensor), shape (L, B, H)

c (tensor), shape (L, B, H)

If add_sos is True:

g:: (B, U + 1, H)
hid:: (h, c) where h is the final sequence hidden state and c is the final cell state:

h (tensor), shape (L, B, H)

c (tensor), shape (L, B, H)

score_hypothesis( hypothesis: Hypothesis, cache: Dict[Tuple[int], Any], ) → Tuple[torch.Tensor, List[torch.Tensor], torch.Tensor]#

Similar to the predict() method, instead this method scores a Hypothesis during beam search. Hypothesis is a dataclass representing one hypothesis in a Beam Search.

Parameters:

hypothesis – Refer to rnnt_utils.Hypothesis.
cache – Dict which contains a cache to avoid duplicate computations.

Returns:

y is a torch.Tensor of shape [1, 1, H] representing the score of the last token in the Hypothesis. state is a list of RNN states, each of shape [L, 1, H]. lm_token is the final integer token of the hypothesis.

Return type:

Returns a tuple (y, states, lm_token) such that

class nemo.collections.asr.modules.StatelessTransducerDecoder(*args: Any, **kwargs: Any)#

Bases: AbstractRNNTDecoder, Exportable

A Stateless Neural Network Transducer Decoder / Prediction Network. An RNN-T Decoder/Prediction stateless network that simply takes concatenation of embeddings of the history tokens as the output.

Parameters:

prednet –
A dict-like object which contains the following key-value pairs. pred_hidden: int specifying the hidden dimension of the prediction net.

dropout: float, set to 0.0 by default. Optional dropout applied at the end of the final LSTM RNN layer.
vocab_size – int, specifying the vocabulary size of the embedding layer of the Prediction network, excluding the RNNT blank token.
context_size – int, specifying the size of the history context used for this decoder.
normalization_mode – Can be either None, ‘layer’. By default, is set to None. Defines the type of normalization applied to the RNN layer.

batch_concat_states( batch_states: List[List[torch.Tensor]], ) → List[torch.Tensor]#

Concatenate a batch of decoder state to a packed state.

Parameters:

batch_states (list) – batch of decoder states B x ([(C)]

Returns:

decoder states: [(B x C)]

Return type:

(tuple)

batch_copy_states( old_states: List[torch.Tensor], new_states: List[torch.Tensor], ids: List[int], value: float | None = None, ) → List[torch.Tensor]#

Copy states from new state to old state at certain indices.

Parameters:

old_states – packed decoder states single element list of (B x C)
new_states – packed decoder states single element list of (B x C)
ids (list) – List of indices to copy states at.
value (optional float) – If a value should be copied instead of a state slice, a float should be provided

Returns:

batch of decoder states with partial copy at ids (or a specific value). (B x C)

batch_initialize_states( decoder_states: List[List[torch.Tensor]], )#

Creates a stacked decoder states to be passed to prediction network.

Parameters:

decoder_states (list of list of torch.Tensor) –

list of decoder states [B, 1, C]

B: Batch size.

C: Dimensionality of the hidden state.

Returns:

batch of decoder states [[B x C]]

Return type:

batch_states (list of torch.Tensor)

classmethod batch_replace_states_all( src_states: list[torch.Tensor], dst_states: list[torch.Tensor], batch_size: int | None = None, )#: Replace states in dst_states with states from src_states

classmethod batch_replace_states_mask( src_states: tuple[torch.Tensor, torch.Tensor] | list[torch.Tensor], dst_states: tuple[torch.Tensor, torch.Tensor] | list[torch.Tensor], mask: torch.Tensor, other_src_states: tuple[torch.Tensor, torch.Tensor] | list[torch.Tensor] | None = None, )#

Replaces states in dst_states with states from src_states based on the given mask.

Parameters:

mask (torch.Tensor) – When True, selects values from src_states, otherwise out or `other_src_states`(if provided).
src_states (tuple[torch.Tensor, torch.Tensor]) – Values selected at indices where mask is True.
dst_states (tuple[torch.Tensor, torch.Tensor], optional) – The output states.
other_src_states (tuple[torch.Tensor, torch.Tensor], optional) – Values selected at indices where mask is False.

Note

This operation is performed without CPU-GPU synchronization by using torch.where.

batch_score_hypothesis( hypotheses: List[Hypothesis], cache: Dict[Tuple[int], Any], ) → Tuple[List[torch.Tensor], List[List[torch.Tensor]]]#

Used for batched beam search algorithms. Similar to score_hypothesis method.

Parameters:

hypothesis – List of Hypotheses. Refer to rnnt_utils.Hypothesis.
cache – Dict which contains a cache to avoid duplicate computations.

Returns:

batch_dec_out: a list of torch.Tensor [1, H] representing the prediction network outputs for the last tokens in the Hypotheses. batch_dec_states: a list of list of RNN states, each of shape [L, B, H]. Represented as B x List[states].

Return type:

Returns a tuple (batch_dec_out, batch_dec_states) such that

batch_select_state( batch_states: List[torch.Tensor], idx: int, ) → List[List[torch.Tensor]]#

Get decoder state from batch of states, for given id.

Parameters:

batch_states (list) – batch of decoder states [(B, C)]
idx (int) – index to extract state from batch of states

Returns:

decoder states for given id: [(C)]

Return type:

(tuple)

classmethod batch_split_states( batch_states: list[torch.Tensor], ) → list[list[torch.Tensor]]#: Split states into a list of states. Useful for splitting the final state for converting results of the decoding algorithm to Hypothesis class.

classmethod batch_unsplit_states( batch_states: list[list[torch.Tensor]], device=None, dtype=None, ) → list[torch.Tensor]#: Concatenate a batch of decoder state to a packed state. Inverse of batch_split_states.

classmethod clone_state( state: list[torch.Tensor], ) → list[torch.Tensor]#: Return copy of the states

initialize_state( y: torch.Tensor, ) → List[torch.Tensor]#

Initialize the state of the RNN layers, with same dtype and device as input y.

Parameters:

y – A torch.Tensor whose device the generated states will be placed on.

Returns:

List of torch.Tensor, each of shape [L, B, H], where: L = Number of RNN layers B = Batch size H = Hidden size of RNN.

input_example(max_batch=1, max_dim=1)#: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#: Returns definitions of module input ports.

mask_select_states( states: List[torch.Tensor] | None, mask: torch.Tensor, ) → List[torch.Tensor] | None#

Return states by mask selection :param states: states for the batch :param mask: boolean mask for selecting states; batch dimension should be the same as for states

Returns:: states filtered by mask

property output_types#: Returns definitions of module output ports.

predict( y: torch.Tensor | None = None, state: torch.Tensor | None = None, add_sos: bool = True, batch_size: int | None = None, ) → Tuple[torch.Tensor, List[torch.Tensor]]#

Stateful prediction of scores and state for a tokenset.

Here: B - batch size U - label length C - context size for stateless decoder D - total embedding size

Parameters:

y – Optional torch tensor of shape [B, U] of dtype long which will be passed to the Embedding. If None, creates a zero tensor of shape [B, 1, D] which mimics output of pad-token on Embedding.
state – An optional one-element list of one tensor. The tensor is used to store previous context labels. The tensor uses type long and is of shape [B, C].
add_sos – bool flag, whether a zero vector describing a “start of signal” token should be prepended to the above “y” tensor. When set, output size is (B, U + 1, D).
batch_size – An optional int, specifying the batch size of the y tensor. Can be infered if y and state is None. But if both are None, then batch_size cannot be None.

Returns:

A tuple (g, state) such that -

If add_sos is False:

g:
(B, U, D)

state:
[(B, C)] storing the history context including the new words in y.

If add_sos is True:

g:
(B, U + 1, D)

state:
[(B, C)] storing the history context including the new words in y.

score_hypothesis( hypothesis: Hypothesis, cache: Dict[Tuple[int], Any], ) → Tuple[torch.Tensor, List[torch.Tensor], torch.Tensor]#

Similar to the predict() method, instead this method scores a Hypothesis during beam search. Hypothesis is a dataclass representing one hypothesis in a Beam Search.

Parameters:

hypothesis – Refer to rnnt_utils.Hypothesis.
cache – Dict which contains a cache to avoid duplicate computations.

Returns:

y is a torch.Tensor of shape [1, 1, H] representing the score of the last token in the Hypothesis. state is a list of RNN states, each of shape [L, 1, H]. lm_token is the final integer token of the hypothesis.

Return type:

Returns a tuple (y, states, lm_token) such that

class nemo.collections.asr.modules.RNNTJoint(*args: Any, **kwargs: Any)#

Bases: AbstractRNNTJoint, Exportable, AdapterModuleMixin

A Recurrent Neural Network Transducer Joint Network (RNN-T Joint Network). An RNN-T Joint network, comprised of a feedforward model.

Parameters:

jointnet –
A dict-like object which contains the following key-value pairs. encoder_hidden: int specifying the hidden dimension of the encoder net. pred_hidden: int specifying the hidden dimension of the prediction net. joint_hidden: int specifying the hidden dimension of the joint net activation: Activation function used in the joint step. Can be one of [‘relu’, ‘tanh’, ‘sigmoid’].

Optionally, it may also contain the following: dropout: float, set to 0.0 by default. Optional dropout applied at the end of the joint net.
num_classes – int, specifying the vocabulary size that the joint network must predict, excluding the RNNT blank token.
vocabulary – Optional list of strings/tokens that comprise the vocabulary of the joint network. Unused and kept only for easy access for character based encoding RNNT models.
log_softmax – Optional bool, set to None by default. If set as None, will compute the log_softmax() based on the value provided.
preserve_memory –
Optional bool, set to False by default. If the model crashes due to the memory intensive joint step, one might try this flag to empty the tensor cache in pytorch.

Warning: This will make the forward-backward pass much slower than normal. It also might not fix the OOM if the GPU simply does not have enough memory to compute the joint.
fuse_loss_wer –
Optional bool, set to False by default.

Fuses the joint forward, loss forward and wer forward steps. In doing so, it trades of speed for memory conservation by creating sub-batches of the provided batch of inputs, and performs Joint forward, loss forward and wer forward (optional), all on sub-batches, then collates results to be exactly equal to results from the entire batch.

When this flag is set, prior to calling forward, the fields loss and wer (either one) must be set using the RNNTJoint.set_loss() or RNNTJoint.set_wer() methods.

Further, when this flag is set, the following argument fused_batch_size must be provided as a non negative integer. This value refers to the size of the sub-batch.

When the flag is set, the input and output signature of forward() of this method changes. Input - in addition to encoder_outputs (mandatory argument), the following arguments can be provided.
- decoder_outputs (optional). Required if loss computation is required.
- encoder_lengths (required)
- transcripts (optional). Required for wer calculation.
- transcript_lengths (optional). Required for wer calculation.
- compute_wer (bool, default false). Whether to compute WER or not for the fused batch.
Output - instead of the usual joint log prob tensor, the following results can be returned.
- loss (optional). Returned if decoder_outputs, transcripts and transript_lengths are not None.
- wer_numerator + wer_denominator (optional). Returned if transcripts, transcripts_lengths are provided
  and compute_wer is set.
fused_batch_size – Optional int, required if fuse_loss_wer flag is set. Determines the size of the sub-batches. Should be any value below the actual batch size per GPU.
masking_prob – Optional float, indicating the probability of masking out decoder output in HAINAN (Hybrid Autoregressive Inference Transducer) model, described in https://arxiv.org/pdf/2410.02597 Default to -1.0, which runs standard Joint network computation; if > 0, then masking out decoder output with the specified probability.

add_adapter(name: str, cfg: omegaconf.DictConfig)#

Add an Adapter module to this module.

Parameters:

name – A globally unique name for the adapter. Will be used to access, enable and disable adapters.
cfg – A DictConfig or Dataclass that contains at the bare minimum __target__ to instantiate a new Adapter module.

property disabled_deployment_input_names#: Implement this method to return a set of input names disabled for export

input_example(max_batch=1, max_dim=8192)#: Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#: Returns definitions of module input ports.

joint_after_projection( f: torch.Tensor, g: torch.Tensor, ) → torch.Tensor#

Compute the joint step of the network after projection.

Here, B = Batch size T = Acoustic model timesteps U = Target sequence length H1, H2 = Hidden dimensions of the Encoder / Decoder respectively H = Hidden dimension of the Joint hidden step. V = Vocabulary size of the Decoder (excluding the RNNT blank token).

Note

The implementation of this model is slightly modified from the original paper. The original paper proposes the following steps : (enc, dec) -> Expand + Concat + Sum [B, T, U, H1+H2] -> Forward through joint hidden [B, T, U, H] – *1 *1 -> Forward through joint final [B, T, U, V + 1].

We instead split the joint hidden into joint_hidden_enc and joint_hidden_dec and act as follows: enc -> Forward through joint_hidden_enc -> Expand [B, T, 1, H] – *1 dec -> Forward through joint_hidden_dec -> Expand [B, 1, U, H] – *2 (*1, *2) -> Sum [B, T, U, H] -> Forward through joint final [B, T, U, V + 1].

Parameters:

f – Output of the Encoder model. A torch.Tensor of shape [B, T, H1]
g – Output of the Decoder model. A torch.Tensor of shape [B, U, H2]

Returns:

Logits / log softmaxed tensor of shape (B, T, U, V + 1).

property output_types#: Returns definitions of module output ports.

project_encoder( encoder_output: torch.Tensor, ) → torch.Tensor#

Project the encoder output to the joint hidden dimension.

Parameters:: encoder_output – A torch.Tensor of shape [B, T, D]
Returns:: A torch.Tensor of shape [B, T, H]

project_prednet( prednet_output: torch.Tensor, ) → torch.Tensor#

Project the Prediction Network (Decoder) output to the joint hidden dimension.

Parameters:: prednet_output – A torch.Tensor of shape [B, U, D]
Returns:: A torch.Tensor of shape [B, U, H]

class nemo.collections.asr.modules.SampledRNNTJoint(*args: Any, **kwargs: Any)#

Bases: RNNTJoint

A Sampled Recurrent Neural Network Transducer Joint Network (RNN-T Joint Network). An RNN-T Joint network, comprised of a feedforward model, where the vocab size will be sampled instead of computing the full vocabulary joint.

Parameters:

jointnet –
A dict-like object which contains the following key-value pairs. encoder_hidden: int specifying the hidden dimension of the encoder net. pred_hidden: int specifying the hidden dimension of the prediction net. joint_hidden: int specifying the hidden dimension of the joint net activation: Activation function used in the joint step. Can be one of [‘relu’, ‘tanh’, ‘sigmoid’].

Optionally, it may also contain the following: dropout: float, set to 0.0 by default. Optional dropout applied at the end of the joint net.
num_classes – int, specifying the vocabulary size that the joint network must predict, excluding the RNNT blank token.
n_samples – int, specifies the number of tokens to sample from the vocabulary space, excluding the RNNT blank token. If a given value is larger than the entire vocabulary size, then the full vocabulary will be used.
vocabulary – Optional list of strings/tokens that comprise the vocabulary of the joint network. Unused and kept only for easy access for character based encoding RNNT models.
log_softmax – Optional bool, set to None by default. If set as None, will compute the log_softmax() based on the value provided.
preserve_memory –
Optional bool, set to False by default. If the model crashes due to the memory intensive joint step, one might try this flag to empty the tensor cache in pytorch.

Warning: This will make the forward-backward pass much slower than normal. It also might not fix the OOM if the GPU simply does not have enough memory to compute the joint.
fuse_loss_wer –
Optional bool, set to False by default.

Fuses the joint forward, loss forward and wer forward steps. In doing so, it trades of speed for memory conservation by creating sub-batches of the provided batch of inputs, and performs Joint forward, loss forward and wer forward (optional), all on sub-batches, then collates results to be exactly equal to results from the entire batch.

When this flag is set, prior to calling forward, the fields loss and wer (either one) must be set using the RNNTJoint.set_loss() or RNNTJoint.set_wer() methods.

Further, when this flag is set, the following argument fused_batch_size must be provided as a non negative integer. This value refers to the size of the sub-batch.

When the flag is set, the input and output signature of forward() of this method changes. Input - in addition to encoder_outputs (mandatory argument), the following arguments can be provided.
- decoder_outputs (optional). Required if loss computation is required.
- encoder_lengths (required)
- transcripts (optional). Required for wer calculation.
- transcript_lengths (optional). Required for wer calculation.
- compute_wer (bool, default false). Whether to compute WER or not for the fused batch.
Output - instead of the usual joint log prob tensor, the following results can be returned.
- loss (optional). Returned if decoder_outputs, transcripts and transript_lengths are not None.
- wer_numerator + wer_denominator (optional). Returned if transcripts, transcripts_lengths are provided
  and compute_wer is set.
fused_batch_size – Optional int, required if fuse_loss_wer flag is set. Determines the size of the sub-batches. Should be any value below the actual batch size per GPU.

sampled_joint( f: torch.Tensor, g: torch.Tensor, transcript: torch.Tensor, transcript_lengths: torch.Tensor, ) → torch.Tensor#

Compute the sampled joint step of the network.

Reference: Memory-Efficient Training of RNN-Transducer with Sampled Softmax.

Here, B = Batch size T = Acoustic model timesteps U = Target sequence length H1, H2 = Hidden dimensions of the Encoder / Decoder respectively H = Hidden dimension of the Joint hidden step. V = Vocabulary size of the Decoder (excluding the RNNT blank token). S = Sample size of vocabulary.

Note

The implementation of this joint model is slightly modified from the original paper. The original paper proposes the following steps : (enc, dec) -> Expand + Concat + Sum [B, T, U, H1+H2] -> Forward through joint hidden [B, T, U, H] – *1 *1 -> Forward through joint final [B, T, U, V + 1].

We instead split the joint hidden into joint_hidden_enc and joint_hidden_dec and act as follows: enc -> Forward through joint_hidden_enc -> Expand [B, T, 1, H] – *1 dec -> Forward through joint_hidden_dec -> Expand [B, 1, U, H] – *2 (*1, *2) -> Sum [B, T, U, H] -> Sample Vocab V_Pos (for target tokens) and V_Neg -> (V_Neg is sampled not uniformly by as a rand permutation of all vocab tokens, then eliminate all Intersection(V_Pos, V_Neg) common tokens to avoid duplication of loss) -> Concat new Vocab V_Sampled = Union(V_Pos, V_Neg) -> Forward partially through the joint final to create [B, T, U, V_Sampled]

Parameters:

f – Output of the Encoder model. A torch.Tensor of shape [B, T, H1]
g – Output of the Decoder model. A torch.Tensor of shape [B, U, H2]
transcript – Batch of transcripts. A torch.Tensor of shape [B, U]
transcript_lengths – Batch of lengths of the transcripts. A torch.Tensor of shape [B]

Returns:

Logits / log softmaxed tensor of shape (B, T, U, V + 1).

Parts#

class nemo.collections.asr.parts.submodules.jasper.JasperBlock(*args: Any, **kwargs: Any)#

Bases: Module, AdapterModuleMixin, AccessMixin

Constructs a single “Jasper” block. With modified parameters, also constructs other blocks for models such as QuartzNet and Citrinet.

For Jasper : separable flag should be False
For QuartzNet : separable flag should be True
For Citrinet : separable flag and se flag should be True

Note that above are general distinctions, each model has intricate differences that expand over multiple such blocks.

For further information about the differences between models which use JasperBlock, please review the configs for ASR models found in the ASR examples directory.

Parameters:

inplanes – Number of input channels.
planes – Number of output channels.
repeat – Number of repeated sub-blocks (R) for this block.
kernel_size – Convolution kernel size across all repeated sub-blocks.
kernel_size_factor – Floating point scale value that is multiplied with kernel size, then rounded down to nearest odd integer to compose the kernel size. Defaults to 1.0.
stride – Stride of the convolutional layers.
dilation – Integer which defined dilation factor of kernel. Note that when dilation > 1, stride must be equal to 1.
padding – String representing type of padding. Currently only supports “same” padding, which symmetrically pads the input tensor with zeros.
dropout – Floating point value, determins percentage of output that is zeroed out.
activation – String representing activation functions. Valid activation functions are : {“hardtanh”: nn.Hardtanh, “relu”: nn.ReLU, “selu”: nn.SELU, “swish”: Swish}. Defaults to “relu”.
residual – Bool that determined whether a residual branch should be added or not. All residual branches are constructed using a pointwise convolution kernel, that may or may not perform strided convolution depending on the parameter residual_mode.
groups – Number of groups for Grouped Convolutions. Defaults to 1.
separable – Bool flag that describes whether Time-Channel depthwise separable convolution should be constructed, or ordinary convolution should be constructed.
heads – Number of “heads” for the masked convolution. Defaults to -1, which disables it.
normalization – String that represents type of normalization performed. Can be one of “batch”, “group”, “instance” or “layer” to compute BatchNorm1D, GroupNorm1D, InstanceNorm or LayerNorm (which are special cases of GroupNorm1D).
norm_groups – Number of groups used for GroupNorm (if normalization == “group”).
residual_mode – String argument which describes whether the residual branch should be simply added (“add”) or should first stride, then add (“stride_add”). Required when performing stride on parallel branch as well as utilizing residual add.
residual_panes – Number of residual panes, used for Jasper-DR models. Please refer to the paper.
conv_mask – Bool flag which determines whether to utilize masked convolutions or not. In general, it should be set to True.
se – Bool flag that determines whether Squeeze-and-Excitation layer should be used.
se_reduction_ratio – Integer value, which determines to what extend the hidden dimension of the SE intermediate step should be reduced. Larger values reduce number of parameters, but also limit the effectiveness of SE layers.
se_context_window – Integer value determining the number of timesteps that should be utilized in order to compute the averaged context window. Defaults to -1, which means it uses global context - such that all timesteps are averaged. If any positive integer is used, it will utilize limited context window of that size.
se_interpolation_mode – String used for interpolation mode of timestep dimension for SE blocks. Used only if context window is > 1. The modes available for resizing are: nearest, linear (3D-only), bilinear, area.
stride_last – Bool flag that determines whether all repeated blocks should stride at once, (stride of S^R when this flag is False) or just the last repeated block should stride (stride of S when this flag is True).
future_context –
Int value that determins how many “right” / “future” context frames will be utilized when calculating the output of the conv kernel. All calculations are done for odd kernel sizes only.

By default, this is -1, which is recomputed as the symmetric padding case.

When future_context >= 0, will compute the asymmetric padding as follows : (left context, right context) = [K - 1 - future_context, future_context]

Determining an exact formula to limit future context is dependent on global layout of the model. As such, we provide both “local” and “global” guidelines below.

Local context limit (should always be enforced) - future context should be <= half the kernel size for any given layer - future context > kernel size defaults to symmetric kernel - future context of layer = number of future frames * width of each frame (dependent on stride)

Global context limit (should be carefully considered) - future context should be layed out in an ever reducing pattern. Initial layers should restrict future context less than later layers, since shallow depth (and reduced stride) means each frame uses less amounts of future context. - Beyond a certain point, future context should remain static for a given stride level. This is the upper bound of the amount of future context that can be provided to the model on a global scale. - future context is calculated (roughly) as - (2 ^ stride) * (K // 2) number of future frames. This resultant value should be bound to some global maximum number of future seconds of audio (in ms).

Note: In the special case where K < future_context, it is assumed that the kernel is too small to limit its future context, so symmetric padding is used instead.

Note: There is no explicit limitation on the amount of future context used, as long as K > future_context constraint is maintained. This might lead to cases where future_context is more than half the actual kernel size K! In such cases, the conv layer is utilizing more of the future context than its current and past context to compute the output. While this is possible to do, it is not recommended and the layer will raise a warning to notify the user of such cases. It is advised to simply use symmetric padding for such cases.

Example: Say we have a model that performs 8x stride and receives spectrogram frames with stride of 0.01s. Say we wish to upper bound future context to 80 ms.

Layer ID, Kernel Size, Stride, Future Context, Global Context 0, K=5, S=1, FC=8, GC= 2 * (2^0) = 2 * 0.01 ms (special case, K < FC so use symmetric pad) 1, K=7, S=1, FC=3, GC= 3 * (2^0) = 3 * 0.01 ms (note that symmetric pad here uses 3 FC frames!) 2, K=11, S=2, FC=4, GC= 4 * (2^1) = 8 * 0.01 ms (note that symmetric pad here uses 5 FC frames!) 3, K=15, S=1, FC=4, GC= 4 * (2^1) = 8 * 0.01 ms (note that symmetric pad here uses 7 FC frames!) 4, K=21, S=2, FC=2, GC= 2 * (2^2) = 8 * 0.01 ms (note that symmetric pad here uses 10 FC frames!) 5, K=25, S=2, FC=1, GC= 1 * (2^3) = 8 * 0.01 ms (note that symmetric pad here uses 14 FC frames!) 6, K=29, S=1, FC=1, GC= 1 * (2^3) = 8 * 0.01 ms …
quantize – Bool flag whether to quantize the Convolutional blocks.
layer_idx (int, optional) – can be specified to allow layer output capture for InterCTC loss. Defaults to -1.

forward( input_: Tuple[List[torch.Tensor], torch.Tensor | None], ) → Tuple[List[torch.Tensor], torch.Tensor | None]#

Forward pass of the module.

Parameters:: input – The input is a tuple of two values - the preprocessed audio signal as well as the lengths of the audio signal. The audio signal is padded to the shape [B, D, T] and the lengths are a torch vector of length B.
Returns:: The output of the block after processing the input through repeat number of sub-blocks, as well as the lengths of the encoded audio after padding/striding.

Mixins#

class nemo.collections.asr.parts.mixins.mixins.ASRBPEMixin#

Bases: ABC

ASR BPE Mixin class that sets up a Tokenizer via a config

This mixin class adds the method _setup_tokenizer(…), which can be used by ASR models which depend on subword tokenization.

The setup_tokenizer method adds the following parameters to the class -

tokenizer_cfg: The resolved config supplied to the tokenizer (with dir and type arguments).
tokenizer_dir: The directory path to the tokenizer vocabulary + additional metadata.
tokenizer_type: The type of the tokenizer. Currently supports bpe and wpe, as well as agg.
vocab_path: Resolved path to the vocabulary text file.

In addition to these variables, the method will also instantiate and preserve a tokenizer (subclass of TokenizerSpec) if successful, and assign it to self.tokenizer.

The mixin also supports aggregate tokenizers, which consist of ordinary, monolingual tokenizers. If a conversion between a monolongual and an aggregate tokenizer (or vice versa) is detected, all registered artifacts will be cleaned up.

save_tokenizers(directory: str)#

Save the model tokenizer(s) to the specified directory.

Parameters:: directory – The directory to save the tokenizer(s) to.

class nemo.collections.asr.parts.mixins.mixins.ASRModuleMixin#

Bases: ASRAdapterModelMixin

ASRModuleMixin is a mixin class added to ASR models in order to add methods that are specific to a particular instantiation of a module inside of an ASRModel.

Each method should first check that the module is present within the subclass, and support additional functionality if the corresponding module is present.

change_attention_model( self_attention_model: str | None = None, att_context_size: List[int] | None = None, update_config: bool = True, )#

Update the self_attention_model if function is available in encoder.

Parameters:

self_attention_model (str) –
type of the attention layer and positional encoding

’rel_pos’:
relative positional embedding and Transformer-XL

’rel_pos_local_attn’:
relative positional embedding and Transformer-XL with local attention using overlapping windows. Attention context is determined by att_context_size parameter.

’abs_pos’:
absolute positional embedding and Transformer

If None is provided, the self_attention_model isn’t changed. Defauts to None.
att_context_size (List[int]) – List of 2 ints corresponding to left and right attention context sizes, or None to keep as it is. Defauts to None.
update_config (bool) – Whether to update the config or not with the new attention model. Defaults to True.

change_conv_asr_se_context_window( context_window: int, update_config: bool = True, )#

Update the context window of the SqueezeExcitation module if the provided model contains an encoder which is an instance of ConvASREncoder.

Parameters:

context_window –
An integer representing the number of input timeframes that will be used to compute the context. Each timeframe corresponds to a single window stride of the STFT features.

Say the window_stride = 0.01s, then a context window of 128 represents 128 * 0.01 s of context to compute the Squeeze step.
update_config – Whether to update the config or not with the new context window.

change_subsampling_conv_chunking_factor( subsampling_conv_chunking_factor: int, update_config: bool = True, )#

Update the conv_chunking_factor (int) if function is available in encoder. Default is 1 (auto) Set it to -1 (disabled) or to a specific value (power of 2) if you OOM in the conv subsampling layers

Parameters:: conv_chunking_factor (int)

conformer_stream_step( processed_signal: torch.Tensor, processed_signal_length: torch.Tensor | None = None, cache_last_channel: torch.Tensor | None = None, cache_last_time: torch.Tensor | None = None, cache_last_channel_len: torch.Tensor | None = None, keep_all_outputs: bool = True, previous_hypotheses: List[Hypothesis] | None = None, previous_pred_out: torch.Tensor | None = None, drop_extra_pre_encoded: int | None = None, return_transcription: bool = True, return_log_probs: bool = False, )#

It simulates a forward step with caching for streaming purposes. It supports the ASR models where their encoder supports streaming like Conformer. :param processed_signal: the input audio signals :param processed_signal_length: the length of the audios :param cache_last_channel: the cache tensor for last channel layers like MHA :param cache_last_channel_len: lengths for cache_last_channel :param cache_last_time: the cache tensor for last time layers like convolutions :param keep_all_outputs: if set to True, would not drop the extra outputs specified by encoder.streaming_cfg.valid_out_len :param previous_hypotheses: the hypotheses from the previous step for RNNT models :param previous_pred_out: the predicted outputs from the previous step for CTC models :param drop_extra_pre_encoded: number of steps to drop from the beginning of the outputs after the downsampling module. This can be used if extra paddings are added on the left side of the input. :param return_transcription: whether to decode and return the transcriptions. It can not get disabled for Transducer models. :param return_log_probs: whether to return the log probs, only valid for ctc model

Returns:: the greedy predictions from the decoder all_hyp_or_transcribed_texts: the decoder hypotheses for Transducer models and the transcriptions for CTC models cache_last_channel_next: the updated tensor cache for last channel layers to be used for next streaming step cache_last_time_next: the updated tensor cache for last time layers to be used for next streaming step cache_last_channel_next_len: the updated lengths for cache_last_channel best_hyp: the best hypotheses for the Transducer models log_probs: the logits tensor of current streaming chunk, only returned when return_log_probs=True encoded_len: the length of the output log_probs + history chunk log_probs, only returned when return_log_probs=True
Return type:: greedy_predictions

transcribe_simulate_cache_aware_streaming( paths2audio_files: List[str], batch_size: int = 4, logprobs: bool = False, return_hypotheses: bool = False, online_normalization: bool = False, )#

Parameters:

paths2audio_files – (a list) of paths to audio files.
batch_size – (int) batch size to use during inference. Bigger will result in better throughput performance but would use more memory.
logprobs – (bool) pass True to get log probabilities instead of transcripts.
return_hypotheses – (bool) Either return hypotheses or text With hypotheses can do some postprocessing like getting timestamp or rescoring
online_normalization – (bool) Perform normalization on the run per chunk.

Returns:

A list of transcriptions (or raw log probabilities if logprobs is True) in the same order as paths2audio_files

class nemo.collections.asr.parts.mixins.transcription.TranscriptionMixin#

Bases: ABC

An abstract class for transcribe-able models.

Creates a template function transcribe() that provides an interface to perform transcription of audio tensors or filepaths.

The following abstract classes must be implemented by the subclass:

_transcribe_input_manifest_processing():
Process the provided input arguments (filepaths only) and return a config dict for the dataloader. The data loader is should generally operate on NeMo manifests.

_setup_transcribe_dataloader():
Setup the dataloader for transcription. Receives the output from _transcribe_input_manifest_processing().

_transcribe_forward():
Implements the model’s custom forward pass to return outputs that are processed by _transcribe_output_processing().

_transcribe_output_processing():
Implements the post processing of the model’s outputs to return the results to the user. The result can be a list of objects, list of list of objects, tuple of objects, tuple of list of objects, or a dict of list of objects.

Template function that defines the execution strategy for transcribing audio.

Parameters:

audio – (a single or list) of paths to audio files or a np.ndarray audio array. Can also be a dataloader object that provides values that can be consumed by the model. Recommended length per file is between 5 and 25 seconds. But it is possible to pass a few hours long file if enough GPU memory is available.
batch_size – (int) batch size to use during inference. Bigger will result in better throughput performance but would use more memory.
return_hypotheses – (bool) Either return hypotheses or text With hypotheses can do some postprocessing like getting timestamp or rescoring
num_workers – (int) number of workers for DataLoader
channel_selector (int | Iterable[int] | str) – select a single channel or a subset of channels from multi-channel audio. If set to ‘average’, it performs averaging across channels. Disabled if set to None. Defaults to None. Uses zero-based indexing.
augmentor – (DictConfig): Augment audio samples during transcription if augmentor is applied.
verbose – (bool) whether to display tqdm progress bar
timestamps – Optional(Bool): timestamps will be returned if set to True as part of hypothesis object (output.timestep[‘segment’]/output.timestep[‘word’]). Refer to Hypothesis class for more details. Default is None and would retain the previous state set by using self.change_decoding_strategy().
override_config – (Optional[TranscribeConfig]) override transcription config pre-defined by the user. Note: All other arguments in the function will be ignored if override_config is passed. You should call this argument as model.transcribe(audio, override_config=TranscribeConfig(…)).
**config_kwargs – (Optional[Dict]) additional arguments to override the default TranscribeConfig. Note: If override_config is passed, these arguments will be ignored.

Returns:

Output is defined by the subclass implementation of TranscriptionMixin._transcribe_output_processing(). It can be:

List[str/Hypothesis]

List[List[str/Hypothesis]]

Tuple[str/Hypothesis]

Tuple[List[str/Hypothesis]]

Dict[str, List[str/Hypothesis]]

transcribe_generator( audio, override_config: TranscribeConfig | None, )#: A generator version of transcribe function.

class nemo.collections.asr.parts.mixins.transcription.TranscribeConfig( batch_size: int = 4, return_hypotheses: bool = False, num_workers: int | None = None, channel_selector: int | Iterable[int] | str = None, augmentor: omegaconf.DictConfig | None = None, timestamps: bool | None = None, verbose: bool = True, partial_hypothesis: List[Any] | None = None, _internal: nemo.collections.asr.parts.mixins.transcription.InternalTranscribeConfig | None = None, )#: Bases: object

class nemo.collections.asr.parts.mixins.interctc_mixin.InterCTCMixin#

Bases: object

Adds utilities for computing interCTC loss from https://arxiv.org/abs/2102.03216.

To use, make sure encoder accesses interctc['capture_layers'] property in the AccessMixin and registers interctc/layer_output_X and interctc/layer_length_X for all layers that we want to get loss from. Additionally, specify the following config parameters to set up loss:

interctc:
    # can use different values
    loss_weights: [0.3]
    apply_at_layers: [8]

Then call

self.setup_interctc(ctc_decoder_name, ctc_loss_name, ctc_wer_name) in the init method

self.add_interctc_losses after computing regular loss.

self.finalize_interctc_metrics(metrics, outputs, prefix="val_") in the multi_validation_epoch_end method.

self.finalize_interctc_metrics(metrics, outputs, prefix="test_") in the multi_test_epoch_end method.

add_interctc_losses( loss_value: torch.Tensor, transcript: torch.Tensor, transcript_len: torch.Tensor, compute_wer: bool, compute_loss: bool = True, log_wer_num_denom: bool = False, log_prefix: str = '', ) → Tuple[torch.Tensor | None, Dict]#

Adding interCTC losses if required.

Will also register loss/wer metrics in the returned dictionary.

Parameters:

loss_value (torch.Tensor) – regular loss tensor (will add interCTC loss to it).
transcript (torch.Tensor) – current utterance transcript.
transcript_len (torch.Tensor) – current utterance transcript length.
compute_wer (bool) – whether to compute WER for the current utterance. Should typically be True for validation/test and only True for training if current batch WER should be logged.
compute_loss (bool) – whether to compute loss for the current utterance. Should always be True in training and almost always True in validation, unless all other losses are disabled as well. Defaults to True.
log_wer_num_denom (bool) – if True, will additionally log WER num/denom in the returned metrics dictionary. Should always be True for validation/test to allow correct metrics aggregation. Should always be False for training. Defaults to False.
log_prefix (str) – prefix added to all log values. Should be "" for training and "val_" for validation. Defaults to “”.

Returns:

tuple of new loss tensor and dictionary with logged metrics.

Return type:

tuple[Optional[torch.Tensor], Dict]

finalize_interctc_metrics( metrics: Dict, outputs: List[Dict], prefix: str, )#

Finalizes InterCTC WER and loss metrics for logging purposes.

Should be called inside multi_validation_epoch_end (with prefix="val_") or multi_test_epoch_end (with prefix="test_").

Note that metrics dictionary is going to be updated in-place.

get_captured_interctc_tensors() → List[Tuple[torch.Tensor, torch.Tensor]]#

Returns a list of captured tensors from encoder: tuples of (output, length).

Will additionally apply ctc_decoder to the outputs.

get_interctc_param(param_name)#: Either directly get parameter from self._interctc_params or call getattr with the corresponding name.

is_interctc_enabled() → bool#: Returns whether interCTC loss is enabled.

set_interctc_enabled(enabled: bool)#: Can be used to enable/disable InterCTC manually.

set_interctc_param(param_name, param_value)#

Setting the parameter to the self._interctc_params dictionary.

Raises an error if trying to set decoder, loss or wer as those should always come from the main class.

setup_interctc(decoder_name, loss_name, wer_name)#

Sets up all interctc-specific parameters and checks config consistency.

Caller has to specify names of attributes to perform CTC-specific WER, decoder and loss computation. They will be looked up in the class state with getattr.

The reason we get the names and look up object later is because those objects might change without re-calling the setup of this class. So we always want to look up the most up-to-date object instead of “caching” it here.

Datasets#

Character Encoding Datasets#

class nemo.collections.asr.data.audio_to_text.AudioToCharDataset(*args: Any, **kwargs: Any)#

Bases: _AudioTextDataset

Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations (in seconds). Each new line is a different sample. Example below: {“audio_filepath”: “/path/to/audio.wav”, “text_filepath”: “/path/to/audio.txt”, “duration”: 23.147} … {“audio_filepath”: “/path/to/audio.wav”, “text”: “the transcription”, “offset”: 301.75, “duration”: 0.82, “utt”: “utterance_id”, “ctm_utt”: “en_4156”, “side”: “A”}

Parameters:

manifest_filepath – Path to manifest json as described above. Can be comma-separated paths.
labels – String containing all the possible characters to map to
sample_rate (int) – Sample rate to resample loaded audio to
int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.
augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio
max_duration – If audio exceeds this length, do not include in dataset
min_duration – If audio is less than this length, do not include in dataset
max_utts – Limit number of utterances
blank_index – blank character index, default = -1
unk_index – unk_character index, default = -1
normalize – whether to normalize transcript text (default): True
bos_id – Id of beginning of sequence symbol to append if not None
eos_id – Id of end of sequence symbol to append if not None
return_sample_id (bool) – whether to return the sample_id as a part of each sample
channel_selector (int | Iterable[int] | str) – select a single channel or a subset of channels from multi-channel audio. If set to ‘average’, it performs averaging across channels. Disabled if set to None. Defaults to None. Uses zero-based indexing.
manifest_parse_func – Optional function to parse manifest entries. Defaults to None.

property output_types: Dict[str, NeuralType] | None#: Returns definitions of module output ports.

class nemo.collections.asr.data.audio_to_text.TarredAudioToCharDataset(*args: Any, **kwargs: Any)#

Bases: decorator

A similar Dataset to the AudioToCharDataset, but which loads tarred audio files.

Accepts a single comma-separated JSON manifest file (in the same style as for the AudioToCharDataset), as well as the path(s) to the tarball(s) containing the wav files. Each line of the manifest should contain the information for one audio file, including at least the transcript and name of the audio file within the tarball.

Valid formats for the audio_tar_filepaths argument include: (1) a single string that can be brace-expanded, e.g. ‘path/to/audio.tar’ or ‘path/to/audio_{1..100}.tar.gz’, or (2) a list of file paths that will not be brace-expanded, e.g. [‘audio_1.tar’, ‘audio_2.tar’, …].

See the WebDataset documentation for more information about accepted data and input formats.

If using multiple workers the number of shards should be divisible by world_size to ensure an even split among workers. If it is not divisible, logging will give a warning but training will proceed. In addition, if using mutiprocessing, each shard MUST HAVE THE SAME NUMBER OF ENTRIES after filtering is applied. We currently do not check for this, but your program may hang if the shards are uneven!

Notice that a few arguments are different from the AudioToCharDataset; for example, shuffle (bool) has been replaced by shuffle_n (int).

Additionally, please note that the len() of this DataLayer is assumed to be the length of the manifest after filtering. An incorrect manifest length may lead to some DataLoader issues down the line.

Parameters:

audio_tar_filepaths – Either a list of audio tarball filepaths, or a string (can be brace-expandable).
manifest_filepath (str) – Path to the manifest.
labels (list) – List of characters that can be output by the ASR model. For Jasper, this is the 28 character set {a-z ‘}. The CTC blank symbol is automatically added later for models using ctc.
sample_rate (int) – Sample rate to resample loaded audio to
int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.
augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio
shuffle_n (int) – How many samples to look ahead and load to be shuffled. See WebDataset documentation for more details. Defaults to 0.
min_duration (float) – Dataset parameter. All training files which have a duration less than min_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to 0.1.
max_duration (float) – Dataset parameter. All training files which have a duration more than max_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to None.
blank_index (int) – Blank character index, defaults to -1.
unk_index (int) – Unknown character index, defaults to -1.
normalize (bool) – Dataset parameter. Whether to use automatic text cleaning. It is highly recommended to manually clean text for best results. Defaults to True.
trim (bool) – Whether to use trim silence from beginning and end of audio signal using librosa.effects.trim(). Defaults to False.
bos_id (id) – Dataset parameter. Beginning of string symbol id used for seq2seq models. Defaults to None.
eos_id (id) – Dataset parameter. End of string symbol id used for seq2seq models. Defaults to None.
pad_id (id) – Token used to pad when collating samples in batches. If this is None, pads using 0s. Defaults to None.
shard_strategy (str) –
Tarred dataset shard distribution strategy chosen as a str value during ddp.
- scatter: The default shard strategy applied by WebDataset, where each node gets a unique set of shards, which are permanently pre-allocated and never changed at runtime.
- replicate: Optional shard strategy, where each node gets all of the set of shards available in the tarred dataset, which are permanently pre-allocated and never changed at runtime. The benefit of replication is that it allows each node to sample data points from the entire dataset independently of other nodes, and reduces dependence on value of shuffle_n.
  
  Warning
  
  Replicated strategy allows every node to sample the entire set of available tarfiles, and therefore more than one node may sample the same tarfile, and even sample the same data points! As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Scattered strategy, on the other hand, on specific occasions (when the number of shards is not divisible with world_size), will not sample the entire dataset. For these reasons it is not advisable to use tarred datasets as validation or test datasets.
global_rank (int) – Worker rank, used for partitioning shards. Defaults to 0.
world_size (int) – Total number of processes, used for partitioning shards. Defaults to 0.
return_sample_id (bool) – whether to return the sample_id as a part of each sample
manifest_parse_func – Optional function to parse manifest entries. Defaults to None.

Text-to-Text Datasets for Hybrid ASR-TTS models#

class nemo.collections.asr.data.text_to_text.TextToTextDataset(*args: Any, **kwargs: Any)#

Bases: TextToTextDatasetBase, Dataset

Text-to-Text Map-style Dataset for hybrid ASR-TTS models

collate_fn( batch: List[TextToTextItem | tuple], ) → TextToTextBatch | TextOrAudioToTextBatch | tuple#: Collate function for dataloader Can accept mixed batch of text-to-text items and audio-text items (typical for ASR)

class nemo.collections.asr.data.text_to_text.TextToTextIterableDataset(*args: Any, **kwargs: Any)#

Bases: TextToTextDatasetBase, IterableDataset

Text-to-Text Iterable Dataset for hybrid ASR-TTS models Only part necessary for current process should be loaded and stored

collate_fn( batch: List[TextToTextItem | tuple], ) → TextToTextBatch | TextOrAudioToTextBatch | tuple#: Collate function for dataloader Can accept mixed batch of text-to-text items and audio-text items (typical for ASR)

Subword Encoding Datasets#

class nemo.collections.asr.data.audio_to_text.AudioToBPEDataset(*args: Any, **kwargs: Any)#

Bases: _AudioTextDataset

Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations (in seconds). Each new line is a different sample. Example below: {“audio_filepath”: “/path/to/audio.wav”, “text_filepath”: “/path/to/audio.txt”, “duration”: 23.147} … {“audio_filepath”: “/path/to/audio.wav”, “text”: “the transcription”, “offset”: 301.75, “duration”: 0.82, “utt”: “utterance_id”, “ctm_utt”: “en_4156”, “side”: “A”}

In practice, the dataset and manifest used for character encoding and byte pair encoding are exactly the same. The only difference lies in how the dataset tokenizes the text in the manifest.

Parameters:

manifest_filepath – Path to manifest json as described above. Can be comma-separated paths.
tokenizer – A subclass of the Tokenizer wrapper found in the common collection, nemo.collections.common.tokenizers.TokenizerSpec. ASR Models support a subset of all available tokenizers.
sample_rate (int) – Sample rate to resample loaded audio to
int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.
augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio
max_duration – If audio exceeds this length, do not include in dataset
min_duration – If audio is less than this length, do not include in dataset
max_utts – Limit number of utterances
trim – Whether to trim silence segments
use_start_end_token – Boolean which dictates whether to add [BOS] and [EOS] tokens to beginning and ending of speech respectively.
return_sample_id (bool) – whether to return the sample_id as a part of each sample
channel_selector (int | Iterable[int] | str) – select a single channel or a subset of channels from multi-channel audio. If set to ‘average’, it performs averaging across channels. Disabled if set to None. Defaults to None. Uses zero-based indexing.
manifest_parse_func – Optional function to parse manifest entries. Defaults to None.

property output_types: Dict[str, NeuralType] | None#: Returns definitions of module output ports.

class nemo.collections.asr.data.audio_to_text.TarredAudioToBPEDataset(*args: Any, **kwargs: Any)#

Bases: decorator

A similar Dataset to the AudioToBPEDataset, but which loads tarred audio files.

Accepts a single comma-separated JSON manifest file (in the same style as for the AudioToBPEDataset), as well as the path(s) to the tarball(s) containing the wav files. Each line of the manifest should contain the information for one audio file, including at least the transcript and name of the audio file within the tarball.

Valid formats for the audio_tar_filepaths argument include: (1) a single string that can be brace-expanded, e.g. ‘path/to/audio.tar’ or ‘path/to/audio_{1..100}.tar.gz’, or (2) a list of file paths that will not be brace-expanded, e.g. [‘audio_1.tar’, ‘audio_2.tar’, …].

See the WebDataset documentation for more information about accepted data and input formats.

If using multiple workers the number of shards should be divisible by world_size to ensure an even split among workers. If it is not divisible, logging will give a warning but training will proceed. In addition, if using mutiprocessing, each shard MUST HAVE THE SAME NUMBER OF ENTRIES after filtering is applied. We currently do not check for this, but your program may hang if the shards are uneven!

Notice that a few arguments are different from the AudioToBPEDataset; for example, shuffle (bool) has been replaced by shuffle_n (int).

Additionally, please note that the len() of this DataLayer is assumed to be the length of the manifest after filtering. An incorrect manifest length may lead to some DataLoader issues down the line.

Parameters:

audio_tar_filepaths – Either a list of audio tarball filepaths, or a string (can be brace-expandable).
manifest_filepath (str) – Path to the manifest.
tokenizer (TokenizerSpec) – Either a Word Piece Encoding tokenizer (BERT), or a Sentence Piece Encoding tokenizer (BPE). The CTC blank symbol is automatically added later for models using ctc.
sample_rate (int) – Sample rate to resample loaded audio to
int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.
augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio
shuffle_n (int) – How many samples to look ahead and load to be shuffled. See WebDataset documentation for more details. Defaults to 0.
min_duration (float) – Dataset parameter. All training files which have a duration less than min_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to 0.1.
max_duration (float) – Dataset parameter. All training files which have a duration more than max_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to None.
trim (bool) – Whether to use trim silence from beginning and end of audio signal using librosa.effects.trim(). Defaults to False.
use_start_end_token – Boolean which dictates whether to add [BOS] and [EOS] tokens to beginning and ending of speech respectively.
pad_id (id) – Token used to pad when collating samples in batches. If this is None, pads using 0s. Defaults to None.
shard_strategy (str) –
Tarred dataset shard distribution strategy chosen as a str value during ddp.
- scatter: The default shard strategy applied by WebDataset, where each node gets a unique set of shards, which are permanently pre-allocated and never changed at runtime.
- replicate: Optional shard strategy, where each node gets all of the set of shards available in the tarred dataset, which are permanently pre-allocated and never changed at runtime. The benefit of replication is that it allows each node to sample data points from the entire dataset independently of other nodes, and reduces dependence on value of shuffle_n.
  
  Warning
  
  Replicated strategy allows every node to sample the entire set of available tarfiles, and therefore more than one node may sample the same tarfile, and even sample the same data points! As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Scattered strategy, on the other hand, on specific occasions (when the number of shards is not divisible with world_size), will not sample the entire dataset. For these reasons it is not advisable to use tarred datasets as validation or test datasets.
global_rank (int) – Worker rank, used for partitioning shards. Defaults to 0.
world_size (int) – Total number of processes, used for partitioning shards. Defaults to 0.
return_sample_id (bool) – whether to return the sample_id as a part of each sample
manifest_parse_func – Optional function to parse manifest entries. Defaults to None.

Audio Preprocessors#

class nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor(*args: Any, **kwargs: Any)#

Bases: AudioPreprocessor, Exportable

Featurizer module that converts wavs to mel spectrograms.

Parameters:

sample_rate (int) – Sample rate of the input audio data. Defaults to 16000
window_size (float) – Size of window for fft in seconds Defaults to 0.02
window_stride (float) – Stride of window for fft in seconds Defaults to 0.01
n_window_size (int) – Size of window for fft in samples Defaults to None. Use one of window_size or n_window_size.
n_window_stride (int) – Stride of window for fft in samples Defaults to None. Use one of window_stride or n_window_stride.
window (str) – Windowing function for fft. can be one of [‘hann’, ‘hamming’, ‘blackman’, ‘bartlett’] Defaults to “hann”
normalize (str) – Can be one of [‘per_feature’, ‘all_features’]; all other options disable feature normalization. ‘all_features’ normalizes the entire spectrogram to be mean 0 with std 1. ‘pre_features’ normalizes per channel / freq instead. Defaults to “per_feature”
n_fft (int) – Length of FT window. If None, it uses the smallest power of 2 that is larger than n_window_size. Defaults to None
preemph (float) – Amount of pre emphasis to add to audio. Can be disabled by passing None. Defaults to 0.97
features (int) – Number of mel spectrogram freq bins to output. Defaults to 64
lowfreq (int) – Lower bound on mel basis in Hz. Defaults to 0
highfreq (int) – Lower bound on mel basis in Hz. Defaults to None
log (bool) – Log features. Defaults to True
log_zero_guard_type (str) – Need to avoid taking the log of zero. There are two options: “add” or “clamp”. Defaults to “add”.
log_zero_guard_value (float, or str) – Add or clamp requires the number to add with or clamp to. log_zero_guard_value can either be a float or “tiny” or “eps”. torch.finfo is used if “tiny” or “eps” is passed. Defaults to 2**-24.
dither (float) – Amount of white-noise dithering. Defaults to 1e-5
pad_to (int) – Ensures that the output size of the time dimension is a multiple of pad_to. Defaults to 16
frame_splicing (int) – Defaults to 1
exact_pad (bool) – If True, sets stft center to False and adds padding, such that num_frames = audio_length // hop_length. Defaults to False.
pad_value (float) – The value that shorter mels are padded with. Defaults to 0
mag_power (float) – The power that the linear spectrogram is raised to prior to multiplication with mel basis. Defaults to 2 for a power spec
rng – Random number generator
nb_augmentation_prob (float) – Probability with which narrowband augmentation would be applied to samples in the batch. Defaults to 0.0
nb_max_freq (int) – Frequency above which all frequencies will be masked for narrowband augmentation. Defaults to 4000
use_torchaudio – Whether to use the torchaudio implementation.
mel_norm – Normalization used for mel filterbank weights. Defaults to ‘slaney’ (area normalization)
stft_exact_pad – Deprecated argument, kept for compatibility with older checkpoints.
stft_conv – Deprecated argument, kept for compatibility with older checkpoints.

input_example( max_batch: int = 8, max_dim: int = 32000, min_length: int = 200, )#: Override this method if random inputs won’t work :returns: A tuple sample of valid input data.

property input_types#: Returns definitions of module input ports.

property output_types#

Returns definitions of module output ports.

processed_signal:: 0: AxisType(BatchTag) 1: AxisType(MelSpectrogramSignalTag) 2: AxisType(ProcessedTimeTag)
processed_length:: 0: AxisType(BatchTag)

classmethod restore_from(restore_path: str)#

Restores model instance (weights and configuration) from a .nemo file

Parameters:

restore_path – path to .nemo file from which model should be instantiated
override_config_path – path to a yaml config that will override the internal config file or an OmegaConf / DictConfig object representing the model config.
map_location – Optional torch.device() to map the instantiated model to a device. By default (None), it will select a GPU if available, falling back to CPU otherwise.
strict – Passed to load_state_dict. By default True
return_config – If set to true, will return just the underlying config of the restored model as an OmegaConf DictConfig object without instantiating the model.
trainer – An optional Trainer object, passed to the model constructor.
save_restore_connector – An optional SaveRestoreConnector object that defines the implementation of the restore_from() method.

save_to(save_path: str)#

Standardized method to save a tarfile containing the checkpoint, config, and any additional artifacts. Implemented via nemo.core.connectors.save_restore_connector.SaveRestoreConnector.save_to().

Parameters:: save_path – str, path to where the file should be saved.

class nemo.collections.asr.modules.AudioToMFCCPreprocessor(*args: Any, **kwargs: Any)#

Bases: AudioPreprocessor

Preprocessor that converts wavs to MFCCs. Uses torchaudio.transforms.MFCC.

Parameters:

sample_rate – The sample rate of the audio. Defaults to 16000.
window_size – Size of window for fft in seconds. Used to calculate the win_length arg for mel spectrogram. Defaults to 0.02
window_stride – Stride of window for fft in seconds. Used to caculate the hop_length arg for mel spect. Defaults to 0.01
n_window_size – Size of window for fft in samples Defaults to None. Use one of window_size or n_window_size.
n_window_stride – Stride of window for fft in samples Defaults to None. Use one of window_stride or n_window_stride.
window – Windowing function for fft. can be one of [‘hann’, ‘hamming’, ‘blackman’, ‘bartlett’, ‘none’, ‘null’]. Defaults to ‘hann’
n_fft – Length of FT window. If None, it uses the smallest power of 2 that is larger than n_window_size. Defaults to None
lowfreq (int) – Lower bound on mel basis in Hz. Defaults to 0
highfreq (int) – Lower bound on mel basis in Hz. Defaults to None
n_mels – Number of mel filterbanks. Defaults to 64
n_mfcc – Number of coefficients to retain Defaults to 64
dct_type – Type of discrete cosine transform to use
norm – Type of norm to use
log – Whether to use log-mel spectrograms instead of db-scaled. Defaults to True.

property input_types#: Returns definitions of module input ports.

property output_types#: Returns definitions of module output ports.

classmethod restore_from(restore_path: str)#

Restores model instance (weights and configuration) from a .nemo file

Parameters:

restore_path – path to .nemo file from which model should be instantiated
override_config_path – path to a yaml config that will override the internal config file or an OmegaConf / DictConfig object representing the model config.
map_location – Optional torch.device() to map the instantiated model to a device. By default (None), it will select a GPU if available, falling back to CPU otherwise.
strict – Passed to load_state_dict. By default True
return_config – If set to true, will return just the underlying config of the restored model as an OmegaConf DictConfig object without instantiating the model.
trainer – An optional Trainer object, passed to the model constructor.
save_restore_connector – An optional SaveRestoreConnector object that defines the implementation of the restore_from() method.

save_to(save_path: str)#

Standardized method to save a tarfile containing the checkpoint, config, and any additional artifacts. Implemented via nemo.core.connectors.save_restore_connector.SaveRestoreConnector.save_to().

Parameters:: save_path – str, path to where the file should be saved.

Audio Augmentors#

class nemo.collections.asr.modules.SpectrogramAugmentation(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Performs time and freq cuts in one of two ways. SpecAugment zeroes out vertical and horizontal sections as described in SpecAugment (https://arxiv.org/abs/1904.08779). Arguments for use with SpecAugment are freq_masks, time_masks, freq_width, and time_width. SpecCutout zeroes out rectangulars as described in Cutout (https://arxiv.org/abs/1708.04552). Arguments for use with Cutout are rect_masks, rect_freq, and rect_time.

Parameters:

freq_masks (int) – how many frequency segments should be cut. Defaults to 0.
time_masks (int) – how many time segments should be cut Defaults to 0.
freq_width (int) – maximum number of frequencies to be cut in one segment. Defaults to 10.
time_width (int) – maximum number of time steps to be cut in one segment Defaults to 10.
rect_masks (int) – how many rectangular masks should be cut Defaults to 0.
rect_freq (int) – maximum size of cut rectangles along the frequency dimension Defaults to 5.
rect_time (int) – maximum size of cut rectangles along the time dimension Defaults to 25.
use_numba_spec_augment – use numba code for Spectrogram augmentation
use_vectorized_spec_augment – use vectorized code for Spectrogram augmentation

property input_types#: Returns definitions of module input types

property output_types#: Returns definitions of module output types

class nemo.collections.asr.modules.CropOrPadSpectrogramAugmentation(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Pad or Crop the incoming Spectrogram to a certain shape.

Parameters:: audio_length (int) – the final number of timesteps that is required. The signal will be either padded or cropped temporally to this size.

property input_types#: Returns definitions of module output ports.

property output_types#: Returns definitions of module output ports.

classmethod restore_from(restore_path: str)#

Restores model instance (weights and configuration) from a .nemo file

Parameters:

restore_path – path to .nemo file from which model should be instantiated
override_config_path – path to a yaml config that will override the internal config file or an OmegaConf / DictConfig object representing the model config.
map_location – Optional torch.device() to map the instantiated model to a device. By default (None), it will select a GPU if available, falling back to CPU otherwise.
strict – Passed to load_state_dict. By default True
return_config – If set to true, will return just the underlying config of the restored model as an OmegaConf DictConfig object without instantiating the model.
trainer – An optional Trainer object, passed to the model constructor.
save_restore_connector – An optional SaveRestoreConnector object that defines the implementation of the restore_from() method.

save_to(save_path: str)#

Standardized method to save a tarfile containing the checkpoint, config, and any additional artifacts. Implemented via nemo.core.connectors.save_restore_connector.SaveRestoreConnector.save_to().

Parameters:: save_path – str, path to where the file should be saved.

class nemo.collections.asr.parts.preprocessing.perturb.SpeedPerturbation( sr, resample_type, min_speed_rate=0.9, max_speed_rate=1.1, num_rates=5, rng=None, )#

Bases: Perturbation

Performs Speed Augmentation by re-sampling the data to a different sampling rate, which does not preserve pitch.

Note: This is a very slow operation for online augmentation. If space allows, it is preferable to pre-compute and save the files to augment the dataset.

Parameters:

sr – Original sampling rate.
resample_type – Type of resampling operation that will be performed. For better speed using resampy’s fast resampling method, use resample_type=’kaiser_fast’. For high-quality resampling, set resample_type=’kaiser_best’. To use scipy.signal.resample, set resample_type=’fft’ or resample_type=’scipy’
min_speed_rate – Minimum sampling rate modifier.
max_speed_rate – Maximum sampling rate modifier.
num_rates – Number of discrete rates to allow. Can be a positive or negative integer. If a positive integer greater than 0 is provided, the range of speed rates will be discretized into num_rates values. If a negative integer or 0 is provided, the full range of speed rates will be sampled uniformly. Note: If a positive integer is provided and the resultant discretized range of rates contains the value ‘1.0’, then those samples with rate=1.0, will not be augmented at all and simply skipped. This is to unnecessary augmentation and increase computation time. Effective augmentation chance in such a case is = prob * (num_rates - 1 / num_rates) * 100`% chance where `prob is the global probability of a sample being augmented.
rng – Random seed. Default is None

class nemo.collections.asr.parts.preprocessing.perturb.TimeStretchPerturbation( min_speed_rate=0.9, max_speed_rate=1.1, num_rates=5, n_fft=512, rng=None, )#

Bases: Perturbation

Time-stretch an audio series by a fixed rate while preserving pitch, based on [1], [2].

Note: This is a simplified implementation, intended primarily for reference and pedagogical purposes. It makes no attempt to handle transients, and is likely to produce audible artifacts.

References

Parameters:

min_speed_rate – Minimum sampling rate modifier.
max_speed_rate – Maximum sampling rate modifier.
num_rates – Number of discrete rates to allow. Can be a positive or negative integer. If a positive integer greater than 0 is provided, the range of speed rates will be discretized into num_rates values. If a negative integer or 0 is provided, the full range of speed rates will be sampled uniformly. Note: If a positive integer is provided and the resultant discretized range of rates contains the value ‘1.0’, then those samples with rate=1.0, will not be augmented at all and simply skipped. This is to avoid unnecessary augmentation and increase computation time. Effective augmentation chance in such a case is = prob * (num_rates - 1 / num_rates) * 100`% chance where `prob is the global probability of a sample being augmented.
n_fft – Number of fft filters to be computed.
rng – Random seed. Default is None

class nemo.collections.asr.parts.preprocessing.perturb.GainPerturbation(min_gain_dbfs=-10, max_gain_dbfs=10, rng=None)#

Bases: Perturbation

Applies random gain to the audio.

Parameters:

min_gain_dbfs (float) – Min gain level in dB
max_gain_dbfs (float) – Max gain level in dB
rng (int) – Random seed. Default is None

class nemo.collections.asr.parts.preprocessing.perturb.ImpulsePerturbation( manifest_path=None, audio_tar_filepaths=None, shuffle_n=128, normalize_impulse=False, shift_impulse=False, rng=None, )#

Bases: Perturbation

Convolves audio with a Room Impulse Response.

Parameters:

manifest_path (list) – Manifest file for RIRs
audio_tar_filepaths (list) – Tar files, if RIR audio files are tarred
shuffle_n (int) – Shuffle parameter for shuffling buffered files from the tar files
normalize_impulse (bool) – Normalize impulse response to zero mean and amplitude 1
shift_impulse (bool) – Shift impulse response to adjust for delay at the beginning
rng (int) – Random seed. Default is None

class nemo.collections.asr.parts.preprocessing.perturb.ShiftPerturbation(min_shift_ms=-5.0, max_shift_ms=5.0, rng=None)#

Bases: Perturbation

Perturbs audio by shifting the audio in time by a random amount between min_shift_ms and max_shift_ms. The final length of the audio is kept unaltered by padding the audio with zeros.

Parameters:

min_shift_ms (float) – Minimum time in milliseconds by which audio will be shifted
max_shift_ms (float) – Maximum time in milliseconds by which audio will be shifted
rng (int) – Random seed. Default is None

class nemo.collections.asr.parts.preprocessing.perturb.NoisePerturbation( manifest_path=None, min_snr_db=10, max_snr_db=50, max_gain_db=300.0, rng=None, audio_tar_filepaths=None, shuffle_n=100, orig_sr=16000, )#

Bases: Perturbation

Perturbation that adds noise to input audio.

Parameters:

manifest_path (str) – Manifest file with paths to noise files
min_snr_db (float) – Minimum SNR of audio after noise is added
max_snr_db (float) – Maximum SNR of audio after noise is added
max_gain_db (float) – Maximum gain that can be applied on the noise sample
audio_tar_filepaths (list) – Tar files, if noise audio files are tarred
shuffle_n (int) – Shuffle parameter for shuffling buffered files from the tar files
orig_sr (int) – Original sampling rate of the noise files
rng (int) – Random seed. Default is None

perturb(data, ref_mic=0)#

Parameters:

data (AudioSegment) – audio data
ref_mic (int) – reference mic index for scaling multi-channel audios

perturb_with_foreground_noise( data, noise, data_rms=None, max_noise_dur=2, max_additions=1, ref_mic=0, )#

Parameters:

data (AudioSegment) – audio data
noise (AudioSegment) – noise data
data_rms (Union[float, List[float]) – rms_db for data input
max_noise_dur – (float): max noise duration
max_additions (int) – number of times for adding noise
ref_mic (int) – reference mic index for scaling multi-channel audios

perturb_with_input_noise( data, noise, data_rms=None, ref_mic=0, )#

Parameters:

data (AudioSegment) – audio data
noise (AudioSegment) – noise data
data_rms (Union[float, List[float]) – rms_db for data input
ref_mic (int) – reference mic index for scaling multi-channel audios

class nemo.collections.asr.parts.preprocessing.perturb.WhiteNoisePerturbation(min_level=-90, max_level=-46, rng=None)#

Bases: Perturbation

Perturbation that adds white noise to an audio file in the training dataset.

Parameters:

min_level (int) – Minimum level in dB at which white noise should be added
max_level (int) – Maximum level in dB at which white noise should be added
rng (int) – Random seed. Default is None

class nemo.collections.asr.parts.preprocessing.perturb.RirAndNoisePerturbation( rir_manifest_path=None, rir_prob=0.5, noise_manifest_paths=None, noise_prob=1.0, min_snr_db=0, max_snr_db=50, rir_tar_filepaths=None, rir_shuffle_n=100, noise_tar_filepaths=None, apply_noise_rir=False, orig_sample_rate=None, max_additions=5, max_duration=2.0, bg_noise_manifest_paths=None, bg_noise_prob=1.0, bg_min_snr_db=10, bg_max_snr_db=50, bg_noise_tar_filepaths=None, bg_orig_sample_rate=None, rng=None, )#

Bases: Perturbation

RIR augmentation with additive foreground and background noise. In this implementation audio data is augmented by first convolving the audio with a Room Impulse Response and then adding foreground noise and background noise at various SNRs. RIR, foreground and background noises should either be supplied with a manifest file or as tarred audio files (faster).

Different sets of noise audio files based on the original sampling rate of the noise. This is useful while training a mixed sample rate model. For example, when training a mixed model with 8 kHz and 16 kHz audio with a target sampling rate of 16 kHz, one would want to augment 8 kHz data with 8 kHz noise rather than 16 kHz noise.

Parameters:

rir_manifest_path – Manifest file for RIRs
rir_tar_filepaths – Tar files, if RIR audio files are tarred
rir_prob – Probability of applying a RIR
noise_manifest_paths – Foreground noise manifest path
min_snr_db – Min SNR for foreground noise
max_snr_db – Max SNR for background noise,
noise_tar_filepaths – Tar files, if noise files are tarred
apply_noise_rir – Whether to convolve foreground noise with a a random RIR
orig_sample_rate – Original sampling rate of foreground noise audio
max_additions – Max number of times foreground noise is added to an utterance,
max_duration – Max duration of foreground noise
bg_noise_manifest_paths – Background noise manifest path
bg_min_snr_db – Min SNR for background noise
bg_max_snr_db – Max SNR for background noise
bg_noise_tar_filepaths – Tar files, if noise files are tarred
bg_orig_sample_rate – Original sampling rate of background noise audio
rng – Random seed. Default is None

class nemo.collections.asr.parts.preprocessing.perturb.TranscodePerturbation(codecs=None, rng=None)#

Bases: Perturbation

Audio codec augmentation. This implementation uses sox to transcode audio with low rate audio codecs, so users need to make sure that the installed sox version supports the codecs used here (G711 and amr-nb).

Parameters:

codecs (List[str]) – A list of codecs to be trancoded to. Default is None.
rng (int) – Random seed. Default is None.

Miscellaneous Classes#

CTC Decoding#

class nemo.collections.asr.parts.submodules.ctc_decoding.CTCDecoding(decoding_cfg, vocabulary)#

Bases: AbstractCTCDecoding

Used for performing CTC auto-regressive / non-auto-regressive decoding of the logprobs for character based models.

Parameters:

decoding_cfg –
A dict-like object which contains the following key-value pairs.
strategy:
str value which represents the type of decoding that can occur. Possible values are :
greedy (for greedy decoding).

beam (for DeepSpeed KenLM based decoding).
compute_timestamps:
A bool flag, which determines whether to compute the character/subword, or word based timestamp mapping the output log-probabilities to discrite intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.

ctc_timestamp_type:
A str value, which represents the types of timestamps that should be calculated. Can take the following values - “char” for character/subword time stamps, “word” for word level time stamps and “all” (default), for both character level and word level time stamps.

word_seperator:
Str token representing the seperator between words.

segment_seperators:
List containing tokens representing the seperator(s) between segments.

segment_gap_threshold:
The threshold (in frames) that caps the gap between two words necessary for forming the segments.

preserve_alignments:
Bool flag which preserves the history of logprobs generated during decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, logprobs is a torch.Tensors.

confidence_cfg:
A dict-like object which contains the following key-value pairs related to confidence scores. In order to obtain hypotheses with confidence scores, please utilize ctc_decoder_predictions_tensor function with the preserve_frame_confidence flag set to True.
preserve_frame_confidence:
Bool flag which preserves the history of per-frame confidence scores generated during decoding. When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of floats.

preserve_token_confidence:
Bool flag which preserves the history of per-token confidence scores generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for token_confidence in it. Here, token_confidence is a List of floats.

The length of the list corresponds to the number of recognized tokens.

preserve_word_confidence:
Bool flag which preserves the history of per-word confidence scores generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for word_confidence in it. Here, word_confidence is a List of floats.

The length of the list corresponds to the number of recognized words.

exclude_blank:
Bool flag indicating that blank token confidence scores are to be excluded from the token_confidence.

aggregation:
Which aggregation type to use for collapsing per-token confidence into per-word confidence. Valid options are mean, min, max, prod.

tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
attached to the regular frame confidence, making TDT frame confidence element a pair: (prediction_confidence, duration_confidence).

method_cfg:
A dict-like object which contains the method name and settings to compute per-frame confidence scores.

name:
The method name (str). Supported values:

’max_prob’ for using the maximum token probability as a confidence.

’entropy’ for using a normalized entropy of a log-likelihood vector.

entropy_type:
Which type of entropy to use (str). Used if confidence_method_cfg.name is set to entropy. Supported values:

’gibbs’ for the (standard) Gibbs entropy. If the alpha (α) is provided,
the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the alpha should comply the following inequality: (log(V)+2-sqrt(log^2(V)+4))/(2*log(V)) <= α <= (1+log(V-1))/log(V-1) where V is the model vocabulary size.

’tsallis’ for the Tsallis entropy with the Boltzmann constant one.
Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

’renyi’ for the Rényi entropy.
Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

alpha:
Power scale for logsoftmax (α for entropies). Here we restrict it to be > 0. When the alpha equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

entropy_norm:
A mapping of the entropy value to the interval [0,1]. Supported values:

’lin’ for using the linear mapping.

’exp’ for using exponential mapping with linear shift.
batch_dim_index:
Index of the batch dimension of targets and predictions parameters of ctc_decoder_predictions_tensor methods. Can be either 0 or 1.
The config may further contain the following sub-dictionaries:

”greedy”:
preserve_alignments: Same as above, overrides above value. compute_timestamps: Same as above, overrides above value. preserve_frame_confidence: Same as above, overrides above value. confidence_method_cfg: Same as above, overrides confidence_cfg.method_cfg.

”beam”:

beam_size:
int, defining the beam size for beam search. Must be >= 1. If beam_size == 1, will perform cached greedy search. This might be slightly different results compared to the greedy search above.

return_best_hypothesis:
optional bool, whether to return just the best hypothesis or all of the hypotheses after beam search has concluded. This flag is set by default.

ngram_lm_alpha:
float, the strength of the Language model on the final score of a token. final_score = acoustic_score + ngram_lm_alpha * lm_score + beam_beta * seq_length.

beam_beta:
float, the strength of the sequence length penalty on the final score of a token. final_score = acoustic_score + ngram_lm_alpha * lm_score + beam_beta * seq_length.

ngram_lm_model:
str, path to a KenLM ARPA or .binary file (depending on the strategy chosen). If the path is invalid (file is not found at path), will raise a deferred error at the moment of calculation of beam search, so that users may update / change the decoding strategy to point to the correct file.
blank_id – The id of the RNNT blank token.

decode_ids_to_tokens( tokens: List[int], ) → List[str]#

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters:: tokens – List of int representing the token ids.
Returns:: A list of decoded tokens.

decode_tokens_to_str(tokens: List[int]) → str#

Implemented by subclass in order to decoder a token list into a string.

Parameters:: tokens – List of int representing the token ids.
Returns:: A decoded string.

static get_words_offsets( char_offsets: List[Dict[str, str | float]], encoded_char_offsets: List[Dict[str, str | float]], word_delimiter_char: str = ' ', supported_punctuation: Set | None = None, ) → List[Dict[str, str | float]]#

Utility method which constructs word time stamps out of character time stamps.

References

This code is a port of the Hugging Face code for word time stamp construction.

Parameters:

char_offsets – A list of dictionaries, each containing “char”, “start_offset” and “end_offset”, where “char” is decoded with the tokenizer.
encoded_char_offsets – A list of dictionaries, each containing “char”, “start_offset” and “end_offset”, where “char” is the original id/ids from the hypotheses (not decoded with the tokenizer). As we are working with char-based models here, we are using the char_offsets to get the word offsets. encoded_char_offsets is passed for keeping the consistency with AbstractRNNTDecoding’s abstract method.
word_delimiter_char – Character token that represents the word delimiter. By default, “ “.
supported_punctuation – Set containing punctuation marks in the vocabulary.

Returns:

A list of dictionaries containing the word offsets. Each item contains “word”, “start_offset” and “end_offset”.

class nemo.collections.asr.parts.submodules.ctc_decoding.CTCBPEDecoding( decoding_cfg, tokenizer: TokenizerSpec, )#

Bases: AbstractCTCDecoding

Used for performing CTC auto-regressive / non-auto-regressive decoding of the logprobs for subword based models.

Parameters:

decoding_cfg –
A dict-like object which contains the following key-value pairs.
strategy:
str value which represents the type of decoding that can occur. Possible values are :
greedy (for greedy decoding).

beam (for DeepSpeed KenLM based decoding).
compute_timestamps:
A bool flag, which determines whether to compute the character/subword, or word based timestamp mapping the output log-probabilities to discrite intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.

ctc_timestamp_type:
A str value, which represents the types of timestamps that should be calculated. Can take the following values - “char” for character/subword time stamps, “word” for word level time stamps and “all” (default), for both character level and word level time stamps.

word_seperator:
Str token representing the seperator between words.

preserve_alignments:
Bool flag which preserves the history of logprobs generated during decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, logprobs is a torch.Tensors.

confidence_cfg:
A dict-like object which contains the following key-value pairs related to confidence scores. In order to obtain hypotheses with confidence scores, please utilize ctc_decoder_predictions_tensor function with the preserve_frame_confidence flag set to True.
preserve_frame_confidence:
Bool flag which preserves the history of per-frame confidence scores generated during decoding. When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of floats.

preserve_token_confidence:
Bool flag which preserves the history of per-token confidence scores generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for token_confidence in it. Here, token_confidence is a List of floats.

The length of the list corresponds to the number of recognized tokens.

preserve_word_confidence:
Bool flag which preserves the history of per-word confidence scores generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for word_confidence in it. Here, word_confidence is a List of floats.

The length of the list corresponds to the number of recognized words.

exclude_blank:
Bool flag indicating that blank token confidence scores are to be excluded from the token_confidence.

aggregation:
Which aggregation type to use for collapsing per-token confidence into per-word confidence. Valid options are mean, min, max, prod.

tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
attached to the regular frame confidence, making TDT frame confidence element a pair: (prediction_confidence, duration_confidence).

method_cfg:
A dict-like object which contains the method name and settings to compute per-frame confidence scores.

name:
The method name (str). Supported values:

’max_prob’ for using the maximum token probability as a confidence.

’entropy’ for using a normalized entropy of a log-likelihood vector.

entropy_type:
Which type of entropy to use (str). Used if confidence_method_cfg.name is set to entropy. Supported values:

’gibbs’ for the (standard) Gibbs entropy. If the alpha (α) is provided,
the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the alpha should comply the following inequality: (log(V)+2-sqrt(log^2(V)+4))/(2*log(V)) <= α <= (1+log(V-1))/log(V-1) where V is the model vocabulary size.

’tsallis’ for the Tsallis entropy with the Boltzmann constant one.
Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

’renyi’ for the Rényi entropy.
Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

alpha:
Power scale for logsoftmax (α for entropies). Here we restrict it to be > 0. When the alpha equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

entropy_norm:
A mapping of the entropy value to the interval [0,1]. Supported values:

’lin’ for using the linear mapping.

’exp’ for using exponential mapping with linear shift.
batch_dim_index:
Index of the batch dimension of targets and predictions parameters of ctc_decoder_predictions_tensor methods. Can be either 0 or 1.
The config may further contain the following sub-dictionaries:

”greedy”:
preserve_alignments: Same as above, overrides above value. compute_timestamps: Same as above, overrides above value. preserve_frame_confidence: Same as above, overrides above value. confidence_method_cfg: Same as above, overrides confidence_cfg.method_cfg.

”beam”:

beam_size:
int, defining the beam size for beam search. Must be >= 1. If beam_size == 1, will perform cached greedy search. This might be slightly different results compared to the greedy search above.

return_best_hypothesis:
optional bool, whether to return just the best hypothesis or all of the hypotheses after beam search has concluded. This flag is set by default.

ngram_lm_alpha:
float, the strength of the Language model on the final score of a token. final_score = acoustic_score + ngram_lm_alpha * lm_score + beam_beta * seq_length.

beam_beta:
float, the strength of the sequence length penalty on the final score of a token. final_score = acoustic_score + ngram_lm_alpha * lm_score + beam_beta * seq_length.

ngram_lm_model:
str, path to a KenLM ARPA or .binary file (depending on the strategy chosen). If the path is invalid (file is not found at path), will raise a deferred error at the moment of calculation of beam search, so that users may update / change the decoding strategy to point to the correct file.
tokenizer – NeMo tokenizer object, which inherits from TokenizerSpec.

decode_ids_to_tokens( tokens: List[int], ) → List[str]#

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters:: tokens – List of int representing the token ids.
Returns:: A list of decoded tokens.

decode_tokens_to_str(tokens: List[int]) → str#

Implemented by subclass in order to decoder a token list into a string.

Parameters:: tokens – List of int representing the token ids.
Returns:: A decoded string.

static define_tokenizer_type( vocabulary: List[str], ) → str#: Define the tokenizer type based on the vocabulary.

static define_word_start_condition( tokenizer_type: str, word_delimiter_char: str, ) → Callable[[str, str], bool]#: Define the word start condition based on the tokenizer type and word delimiter character.

get_words_offsets( char_offsets: List[Dict[str, str | float]], encoded_char_offsets: List[Dict[str, str | float]], word_delimiter_char: str = ' ', supported_punctuation: Set | None = None, ) → List[Dict[str, str | float]]#

Utility method which constructs word time stamps out of sub-word time stamps.

Note: Only supports Sentencepiece based tokenizers !

Parameters:

char_offsets – A list of dictionaries, each containing “char”, “start_offset” and “end_offset”, where “char” is decoded with the tokenizer.
encoded_char_offsets – A list of dictionaries, each containing “char”, “start_offset” and “end_offset”, where “char” is the original id/ids from the hypotheses (not decoded with the tokenizer). This is needed for subword tokenization models.
word_delimiter_char – Character token that represents the word delimiter. By default, “ “.
supported_punctuation – Set containing punctuation marks in the vocabulary.

Returns:

A list of dictionaries containing the word offsets. Each item contains “word”, “start_offset” and “end_offset”.

class nemo.collections.asr.parts.submodules.ctc_greedy_decoding.GreedyCTCInfer( blank_id: int, preserve_alignments: bool = False, compute_timestamps: bool = False, preserve_frame_confidence: bool = False, confidence_method_cfg: omegaconf.DictConfig | None = None, )#

Bases: Typing, ConfidenceMethodMixin

A greedy CTC decoder.

Provides a common abstraction for sample level and batch level greedy decoding.

Parameters:

blank_index – int index of the blank token. Can be 0 or len(vocabulary).
preserve_alignments – Bool flag which preserves the history of logprobs generated during decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, logprobs is a torch.Tensors.
compute_timestamps – A bool flag, which determines whether to compute the character/subword, or word based timestamp mapping the output log-probabilities to discrite intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.
preserve_frame_confidence – Bool flag which preserves the history of per-frame confidence scores generated during decoding. When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of floats.
confidence_method_cfg –
A dict-like object which contains the method name and settings to compute per-frame confidence scores.
name: The method name (str).
Supported values:
’max_prob’ for using the maximum token probability as a confidence.

’entropy’ for using a normalized entropy of a log-likelihood vector.
entropy_type: Which type of entropy to use (str). Used if confidence_method_cfg.name is set to entropy.
Supported values:
’gibbs’ for the (standard) Gibbs entropy. If the alpha (α) is provided,
the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the alpha should comply the following inequality: (log(V)+2-sqrt(log^2(V)+4))/(2*log(V)) <= α <= (1+log(V-1))/log(V-1) where V is the model vocabulary size.

’tsallis’ for the Tsallis entropy with the Boltzmann constant one.
Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

’renyi’ for the Rényi entropy.
Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy
alpha: Power scale for logsoftmax (α for entropies). Here we restrict it to be > 0.
When the alpha equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

entropy_norm: A mapping of the entropy value to the interval [0,1].
Supported values:
’lin’ for using the linear mapping.

’exp’ for using exponential mapping with linear shift.

forward( decoder_output: torch.Tensor, decoder_lengths: torch.Tensor | None, )#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-repressively.

Parameters:

decoder_output – A tensor of size (batch, timesteps, features) or (batch, timesteps) (each timestep is a label).
decoder_lengths – list of int representing the length of each sequence output sequence.

Returns:

packed list containing batch number of sentences (Hypotheses).

property input_types#: Returns definitions of module input ports.

property output_types#: Returns definitions of module output ports.

class nemo.collections.asr.parts.submodules.ctc_beam_decoding.BeamCTCInfer( blank_id: int, beam_size: int, search_type: str = 'default', return_best_hypothesis: bool = True, preserve_alignments: bool = False, compute_timestamps: bool = False, ngram_lm_alpha: float = 0.3, beam_beta: float = 0.0, ngram_lm_model: str | None = None, flashlight_cfg: FlashlightConfig | None = None, pyctcdecode_cfg: PyCTCDecodeConfig | None = None, )#

Bases: AbstractBeamCTCInfer

A beam CTC decoder.

Provides a common abstraction for sample level and batch level greedy decoding.

Parameters:

blank_index – int index of the blank token. Can be 0 or len(vocabulary).
preserve_alignments – Bool flag which preserves the history of logprobs generated during decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, logprobs is a torch.Tensors.
compute_timestamps – A bool flag, which determines whether to compute the character/subword, or word based timestamp mapping the output log-probabilities to discrite intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.

default_beam_search( x: torch.Tensor, out_len: torch.Tensor, ) → List[Hypothesis | NBestHypotheses]#

Open Seq2Seq Beam Search Algorithm (DeepSpeed)

Parameters:

x – Tensor of shape [B, T, V+1], where B is the batch size, T is the maximum sequence length, and V is the vocabulary size. The tensor contains log-probabilities.
out_len – Tensor of shape [B], contains lengths of each sequence in the batch.

Returns:

A list of NBestHypotheses objects, one for each sequence in the batch.

flashlight_beam_search( x: torch.Tensor, out_len: torch.Tensor, ) → List[Hypothesis | NBestHypotheses]#

Flashlight Beam Search Algorithm. Should support Char and Subword models.

Parameters:

x – Tensor of shape [B, T, V+1], where B is the batch size, T is the maximum sequence length, and V is the vocabulary size. The tensor contains log-probabilities.
out_len – Tensor of shape [B], contains lengths of each sequence in the batch.

Returns:

A list of NBestHypotheses objects, one for each sequence in the batch.

forward( decoder_output: torch.Tensor, decoder_lengths: torch.Tensor, ) → Tuple[List[Hypothesis | NBestHypotheses]]#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-repressively.

Parameters:

decoder_output – A tensor of size (batch, timesteps, features).
decoder_lengths – list of int representing the length of each sequence output sequence.

Returns:

packed list containing batch number of sentences (Hypotheses).

set_decoding_type(decoding_type: str)#

Sets the decoding type of the framework. Can support either char or subword models.

Parameters:: decoding_type – Str corresponding to decoding type. Only supports “char” and “subword”.

RNNT Decoding#

class nemo.collections.asr.parts.submodules.rnnt_decoding.RNNTDecoding(decoding_cfg, decoder, joint, vocabulary)#

Bases: AbstractRNNTDecoding

Used for performing RNN-T auto-regressive decoding of the Decoder+Joint network given the encoder state.

Parameters:

decoding_cfg –
A dict-like object which contains the following key-value pairs.
strategy:
str value which represents the type of decoding that can occur. Possible values are :
- greedy, greedy_batch (for greedy decoding).
- beam, tsd, alsd (for beam search decoding).
compute_hypothesis_token_set: A bool flag, which determines whether to compute a list of decoded
tokens as well as the decoded string. Default is False in order to avoid double decoding unless required.

preserve_alignments: Bool flag which preserves the history of logprobs generated during
decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for logprobs in it. Here, alignments is a List of List of Tuple(Tensor (of length V + 1), Tensor(scalar, label after argmax)).

In order to obtain this hypothesis, please utilize rnnt_decoder_predictions_tensor function with the return_hypotheses flag set to True.

The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more targets from a vocabulary. U is the number of target tokens for the current timestep Ti.

confidence_cfg: A dict-like object which contains the following key-value pairs related to confidence
scores. In order to obtain hypotheses with confidence scores, please utilize rnnt_decoder_predictions_tensor function with the preserve_frame_confidence flag set to True.
preserve_frame_confidence: Bool flag which preserves the history of per-frame confidence scores
generated during decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, alignments is a List of List of floats.

The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more confidence scores. U is the number of target tokens for the current timestep Ti.

preserve_token_confidence: Bool flag which preserves the history of per-token confidence scores
generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for token_confidence in it. Here, token_confidence is a List of floats.

The length of the list corresponds to the number of recognized tokens.

preserve_word_confidence: Bool flag which preserves the history of per-word confidence scores
generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for word_confidence in it. Here, word_confidence is a List of floats.

The length of the list corresponds to the number of recognized words.

exclude_blank: Bool flag indicating that blank token confidence scores are to be excluded
from the token_confidence.

aggregation: Which aggregation type to use for collapsing per-token confidence into per-word
confidence. Valid options are mean, min, max, prod.

tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
attached to the regular frame confidence, making TDT frame confidence element a pair: (prediction_confidence, duration_confidence).

method_cfg: A dict-like object which contains the method name and settings to compute per-frame
confidence scores.

name:
The method name (str). Supported values:

’max_prob’ for using the maximum token probability as a confidence.

’entropy’ for using a normalized entropy of a log-likelihood vector.

entropy_type:
Which type of entropy to use (str). Used if confidence_method_cfg.name is set to entropy. Supported values:

’gibbs’ for the (standard) Gibbs entropy. If the alpha (α) is provided,
the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the alpha should comply the following inequality: (log(V)+2-sqrt(log^2(V)+4))/(2*log(V)) <= α <= (1+log(V-1))/log(V-1) where V is the model vocabulary size.

’tsallis’ for the Tsallis entropy with the Boltzmann constant one.
Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

’renyi’ for the Rényi entropy.
Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

alpha:
Power scale for logsoftmax (α for entropies). Here we restrict it to be > 0. When the alpha equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

entropy_norm:
A mapping of the entropy value to the interval [0,1]. Supported values:

’lin’ for using the linear mapping.

’exp’ for using exponential mapping with linear shift.
The config may further contain the following sub-dictionaries:

”greedy”:

max_symbols: int, describing the maximum number of target tokens to decode per
timestep during greedy decoding. Setting to larger values allows longer sentences to be decoded, at the cost of increased execution time.

preserve_frame_confidence: Same as above, overrides above value.

confidence_method_cfg: Same as above, overrides confidence_cfg.method_cfg.

”beam”:

beam_size: int, defining the beam size for beam search. Must be >= 1.
If beam_size == 1, will perform cached greedy search. This might be slightly different results compared to the greedy search above.

score_norm: optional bool, whether to normalize the returned beam score in the hypotheses.
Set to True by default.

return_best_hypothesis: optional bool, whether to return just the best hypothesis or all of the
hypotheses after beam search has concluded. This flag is set by default.

tsd_max_sym_exp: optional int, determines number of symmetric expansions of the target symbols
per timestep of the acoustic model. Larger values will allow longer sentences to be decoded, at increased cost to execution time.

alsd_max_target_len: optional int or float, determines the potential maximum target sequence
length. If an integer is provided, it can decode sequences of that particular maximum length. If a float is provided, it can decode sequences of int(alsd_max_target_len * seq_len), where seq_len is the length of the acoustic model output (T).

NOTE:
If a float is provided, it can be greater than 1! By default, a float of 2.0 is used so that a target sequence can be at most twice as long as the acoustic model output length T.

maes_num_steps: Number of adaptive steps to take. From the paper, 2 steps is generally sufficient,
and can be reduced to 1 to improve decoding speed while sacrificing some accuracy. int > 0.

maes_prefix_alpha: Maximum prefix length in prefix search. Must be an integer, and is advised to keep this as 1 in order to reduce expensive beam search cost later. int >= 0.

maes_expansion_beta: Maximum number of prefix expansions allowed, in addition to the beam size.
Effectively, the number of hypothesis = beam_size + maes_expansion_beta. Must be an int >= 0, and affects the speed of inference since large values will perform large beam search in the next step.

maes_expansion_gamma: Float pruning threshold used in the prune-by-value step when computing the
expansions. The default (2.3) is selected from the paper. It performs a comparison (max_log_prob - gamma <= log_prob[v]) where v is all vocabulary indices in the Vocab set and max_log_prob is the “most” likely token to be predicted. Gamma therefore provides a margin of additional tokens which can be potential candidates for expansion apart from the “most likely” candidate. Lower values will reduce the number of expansions (by increasing pruning-by-value, thereby improving speed but hurting accuracy). Higher values will increase the number of expansions (by reducing pruning-by-value, thereby reducing speed but potentially improving accuracy). This is a hyper parameter to be experimentally tuned on a validation set.

softmax_temperature: Scales the logits of the joint prior to computing log_softmax.
decoder – The Decoder/Prediction network module.
joint – The Joint network module.
vocabulary – The vocabulary (excluding the RNNT blank token) which will be used for decoding.

decode_ids_to_langs( tokens: List[int], ) → List[str]#

Decode a token id list into language ID (LID) list.

Parameters:: tokens – List of int representing the token ids.
Returns:: A list of decoded LIDS.

decode_ids_to_tokens( tokens: List[int], ) → List[str]#

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters:: tokens – List of int representing the token ids.
Returns:: A list of decoded tokens.

decode_tokens_to_lang(tokens: List[int]) → str#

Compute the most likely language ID (LID) string given the tokens.

Parameters:: tokens – List of int representing the token ids.
Returns:: A decoded LID string.

decode_tokens_to_str(tokens: List[int]) → str#

Implemented by subclass in order to decoder a token list into a string.

Parameters:: tokens – List of int representing the token ids.
Returns:: A decoded string.

static get_words_offsets( char_offsets: List[Dict[str, str | float]], encoded_char_offsets: List[Dict[str, str | float]], word_delimiter_char: str = ' ', supported_punctuation: Set | None = None, ) → List[Dict[str, str | float]]#

Utility method which constructs word time stamps out of character time stamps.

References

This code is a port of the Hugging Face code for word time stamp construction.

Parameters:

char_offsets – A list of dictionaries, each containing “char”, “start_offset” and “end_offset”, where “char” is decoded with the tokenizer.
encoded_char_offsets – A list of dictionaries, each containing “char”, “start_offset” and “end_offset”, where “char” is the original id/ids from the hypotheses (not decoded with the tokenizer). As we are working with char-based models here, we are using the char_offsets to get the word offsets. encoded_char_offsets is passed for keeping the consistency with AbstractRNNTDecoding’s abstract method.
word_delimiter_char – Character token that represents the word delimiter. By default, “ “.
supported_punctuation – Set containing punctuation marks in the vocabulary.

Returns:

A list of dictionaries containing the word offsets. Each item contains “word”, “start_offset” and “end_offset”.

class nemo.collections.asr.parts.submodules.rnnt_decoding.RNNTBPEDecoding( decoding_cfg, decoder, joint, tokenizer: TokenizerSpec, )#

Bases: AbstractRNNTDecoding

Used for performing RNN-T auto-regressive decoding of the Decoder+Joint network given the encoder state.

Parameters:

decoding_cfg –
A dict-like object which contains the following key-value pairs.
strategy:
str value which represents the type of decoding that can occur. Possible values are :
- greedy, greedy_batch (for greedy decoding).
- beam, tsd, alsd (for beam search decoding).
compute_hypothesis_token_set: A bool flag, which determines whether to compute a list of decoded
tokens as well as the decoded string. Default is False in order to avoid double decoding unless required.

preserve_alignments: Bool flag which preserves the history of logprobs generated during
decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for alignments in it. Here, alignments is a List of List of Tuple(Tensor (of length V + 1), Tensor(scalar, label after argmax)).

In order to obtain this hypothesis, please utilize rnnt_decoder_predictions_tensor function with the return_hypotheses flag set to True.

The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more targets from a vocabulary. U is the number of target tokens for the current timestep Ti.

compute_timestamps: A bool flag, which determines whether to compute the character/subword, or
word based timestamp mapping the output log-probabilities to discrete intervals of timestamps. The timestamps will be available in the returned Hypothesis.timestep as a dictionary.

compute_langs: a bool flag, which allows to compute language id (LID) information per token,
word, and the entire sample (most likely language id). The LIDS will be available in the returned Hypothesis object as a dictionary

rnnt_timestamp_type: A str value, which represents the types of timestamps that should be calculated.
Can take the following values - “char” for character/subword time stamps, “word” for word level time stamps and “all” (default), for both character level and word level time stamps.
word_seperator: Str token representing the seperator between words.

segment_seperators: List containing tokens representing the seperator(s) between segments.
segment_gap_threshold: The threshold (in frames) that caps the gap between two words necessary for forming
the segments.

preserve_frame_confidence: Bool flag which preserves the history of per-frame confidence scores
generated during decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, alignments is a List of List of ints.

confidence_cfg: A dict-like object which contains the following key-value pairs related to confidence
scores. In order to obtain hypotheses with confidence scores, please utilize rnnt_decoder_predictions_tensor function with the preserve_frame_confidence flag set to True.
preserve_frame_confidence: Bool flag which preserves the history of per-frame confidence scores
generated during decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, alignments is a List of List of floats.

The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more confidence scores. U is the number of target tokens for the current timestep Ti.

preserve_token_confidence: Bool flag which preserves the history of per-token confidence scores
generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for token_confidence in it. Here, token_confidence is a List of floats.

The length of the list corresponds to the number of recognized tokens.

preserve_word_confidence: Bool flag which preserves the history of per-word confidence scores
generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for word_confidence in it. Here, word_confidence is a List of floats.

The length of the list corresponds to the number of recognized words.

exclude_blank: Bool flag indicating that blank token confidence scores are to be excluded
from the token_confidence.

aggregation: Which aggregation type to use for collapsing per-token confidence into per-word
confidence. Valid options are mean, min, max, prod.

tdt_include_duration: Bool flag indicating that the duration confidence scores are to be calculated and
attached to the regular frame confidence, making TDT frame confidence element a pair: (prediction_confidence, duration_confidence).

method_cfg: A dict-like object which contains the method name and settings to compute per-frame
confidence scores.

name:
The method name (str). Supported values:

’max_prob’ for using the maximum token probability as a confidence.

’entropy’ for using a normalized entropy of a log-likelihood vector.

entropy_type: Which type of entropy to use (str).
Used if confidence_method_cfg.name is set to entropy. Supported values:

’gibbs’ for the (standard) Gibbs entropy. If the alpha (α) is provided,
the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the alpha should comply the following inequality: (log(V)+2-sqrt(log^2(V)+4))/(2*log(V)) <= α <= (1+log(V-1))/log(V-1) where V is the model vocabulary size.

’tsallis’ for the Tsallis entropy with the Boltzmann constant one.
Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

’renyi’ for the Rényi entropy.
Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy

alpha: Power scale for logsoftmax (α for entropies). Here we restrict it to be > 0.
When the alpha equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

entropy_norm: A mapping of the entropy value to the interval [0,1].
Supported values:

’lin’ for using the linear mapping.

’exp’ for using exponential mapping with linear shift.
The config may further contain the following sub-dictionaries:

”greedy”:

max_symbols: int, describing the maximum number of target tokens to decode per
timestep during greedy decoding. Setting to larger values allows longer sentences to be decoded, at the cost of increased execution time.

preserve_frame_confidence: Same as above, overrides above value.

confidence_method_cfg: Same as above, overrides confidence_cfg.method_cfg.

”beam”:

beam_size: int, defining the beam size for beam search. Must be >= 1.
If beam_size == 1, will perform cached greedy search. This might be slightly different results compared to the greedy search above.

score_norm: optional bool, whether to normalize the returned beam score in the hypotheses.
Set to True by default.

return_best_hypothesis: optional bool, whether to return just the best hypothesis or all of the
hypotheses after beam search has concluded.

tsd_max_sym_exp: optional int, determines number of symmetric expansions of the target symbols
per timestep of the acoustic model. Larger values will allow longer sentences to be decoded, at increased cost to execution time.

alsd_max_target_len: optional int or float, determines the potential maximum target sequence
length.If an integer is provided, it can decode sequences of that particular maximum length. If a float is provided, it can decode sequences of int(alsd_max_target_len * seq_len), where seq_len is the length of the acoustic model output (T).

NOTE:
If a float is provided, it can be greater than 1! By default, a float of 2.0 is used so that a target sequence can be at most twice as long as the acoustic model output length T.

maes_num_steps: Number of adaptive steps to take. From the paper, 2 steps is generally sufficient,
and can be reduced to 1 to improve decoding speed while sacrificing some accuracy. int > 0.

maes_prefix_alpha: Maximum prefix length in prefix search. Must be an integer, and is advised to
keep this as 1 in order to reduce expensive beam search cost later. int >= 0.

maes_expansion_beta: Maximum number of prefix expansions allowed, in addition to the beam size.
Effectively, the number of hypothesis = beam_size + maes_expansion_beta. Must be an int >= 0, and affects the speed of inference since large values will perform large beam search in the next step.

maes_expansion_gamma: Float pruning threshold used in the prune-by-value step when computing the
expansions. The default (2.3) is selected from the paper. It performs a comparison (max_log_prob - gamma <= log_prob[v]) where v is all vocabulary indices in the Vocab set and max_log_prob is the “most” likely token to be predicted. Gamma therefore provides a margin of additional tokens which can be potential candidates for expansion apart from the “most likely” candidate. Lower values will reduce the number of expansions (by increasing pruning-by-value, thereby improving speed but hurting accuracy). Higher values will increase the number of expansions (by reducing pruning-by-value, thereby reducing speed but potentially improving accuracy). This is a hyper parameter to be experimentally tuned on a validation set.

softmax_temperature: Scales the logits of the joint prior to computing log_softmax.
decoder – The Decoder/Prediction network module.
joint – The Joint network module.
tokenizer – The tokenizer which will be used for decoding.

decode_hypothesis( hypotheses_list: List[Hypothesis], ) → List[Hypothesis | NBestHypotheses]#

Decode a list of hypotheses into a list of strings. Overrides the super() method optionally adding lang information

Parameters:: hypotheses_list – List of Hypothesis.
Returns:: A list of strings.

decode_ids_to_langs( tokens: List[int], ) → List[str]#

Decode a token id list into language ID (LID) list.

Parameters:: tokens – List of int representing the token ids.
Returns:: A list of decoded LIDS.

decode_ids_to_tokens( tokens: List[int], ) → List[str]#

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters:: tokens – List of int representing the token ids.
Returns:: A list of decoded tokens.

decode_tokens_to_lang( tokens: List[int], ) → str#

Compute the most likely language ID (LID) string given the tokens.

Parameters:: tokens – List of int representing the token ids.
Returns:: A decoded LID string.

decode_tokens_to_str(tokens: List[int]) → str#

Implemented by subclass in order to decoder a token list into a string.

Parameters:: tokens – List of int representing the token ids.
Returns:: A decoded string.

static define_tokenizer_type( vocabulary: List[str], ) → str#: Define the tokenizer type based on the vocabulary.

static define_word_start_condition( tokenizer_type: str, word_delimiter_char: str, ) → Callable[[str, str], bool]#: Define the word start condition based on the tokenizer type and word delimiter character.

get_words_offsets( char_offsets: List[Dict[str, str | float]], encoded_char_offsets: List[Dict[str, str | float]], word_delimiter_char: str = ' ', supported_punctuation: Set | None = None, ) → List[Dict[str, str | float]]#

Utility method which constructs word time stamps out of sub-word time stamps.

Note: Only supports Sentencepiece based tokenizers !

Parameters:

char_offsets – A list of dictionaries, each containing “char”, “start_offset” and “end_offset”, where “char” is decoded with the tokenizer.
encoded_char_offsets – A list of dictionaries, each containing “char”, “start_offset” and “end_offset”, where “char” is the original id/ids from the hypotheses (not decoded with the tokenizer). This is needed for subword tokenization models.
word_delimiter_char – Character token that represents the word delimiter. By default, “ “.
supported_punctuation – Set containing punctuation marks in the vocabulary.

Returns:

A list of dictionaries containing the word offsets. Each item contains “word”, “start_offset” and “end_offset”.

class nemo.collections.asr.parts.submodules.rnnt_greedy_decoding.GreedyRNNTInfer( decoder_model: AbstractRNNTDecoder, joint_model: AbstractRNNTJoint, blank_index: int, max_symbols_per_step: int | None = None, preserve_alignments: bool = False, preserve_frame_confidence: bool = False, confidence_method_cfg: omegaconf.DictConfig | None = None, )#

Bases: _GreedyRNNTInfer

A greedy transducer decoder.

Sequence level greedy decoding, performed auto-regressively.

Parameters:

decoder_model – rnnt_utils.AbstractRNNTDecoder implementation.
joint_model – rnnt_utils.AbstractRNNTJoint implementation.
blank_index – int index of the blank token. Can be 0 or len(vocabulary).
max_symbols_per_step – Optional int. The maximum number of symbols that can be added to a sequence in a single time step; if set to None then there is no limit.
preserve_alignments –
Bool flag which preserves the history of alignments generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for alignments in it. Here, alignments is a List of List of Tuple(Tensor (of length V + 1), Tensor(scalar, label after argmax)).

The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more targets from a vocabulary. U is the number of target tokens for the current timestep Ti.
preserve_frame_confidence –
Bool flag which preserves the history of per-frame confidence scores generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of List of floats.

The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more confidence scores. U is the number of target tokens for the current timestep Ti.
confidence_method_cfg –
A dict-like object which contains the method name and settings to compute per-frame confidence scores.
name: The method name (str).
Supported values:
’max_prob’ for using the maximum token probability as a confidence.

’entropy’ for using a normalized entropy of a log-likelihood vector.
entropy_type: Which type of entropy to use (str). Used if confidence_method_cfg.name is set to entropy.
Supported values:
’gibbs’ for the (standard) Gibbs entropy. If the alpha (α) is provided,
the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the alpha should comply the following inequality: (log(V)+2-sqrt(log^2(V)+4))/(2*log(V)) <= α <= (1+log(V-1))/log(V-1) where V is the model vocabulary size.

’tsallis’ for the Tsallis entropy with the Boltzmann constant one.
Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

’renyi’ for the Rényi entropy.
Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy
alpha: Power scale for logsoftmax (α for entropies). Here we restrict it to be > 0.
When the alpha equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

entropy_norm: A mapping of the entropy value to the interval [0,1].
Supported values:
’lin’ for using the linear mapping.

’exp’ for using exponential mapping with linear shift.

forward( encoder_output: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: List[Hypothesis] | None = None, )#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-regressively.

Parameters:

encoder_output – A tensor of size (batch, features, timesteps).
encoded_lengths – list of int representing the length of each sequence output sequence.

Returns:

packed list containing batch number of sentences (Hypotheses).

class nemo.collections.asr.parts.submodules.rnnt_greedy_decoding.GreedyBatchedRNNTInfer( decoder_model: AbstractRNNTDecoder, joint_model: AbstractRNNTJoint, blank_index: int, max_symbols_per_step: int | None = None, preserve_alignments: bool = False, preserve_frame_confidence: bool = False, confidence_method_cfg: omegaconf.DictConfig | None = None, loop_labels: bool = True, use_cuda_graph_decoder: bool = True, ngram_lm_model: str | Path | None = None, ngram_lm_alpha: float = 0.0, )#

Bases: _GreedyRNNTInfer, WithOptionalCudaGraphs

A batch level greedy transducer decoder.

Batch level greedy decoding, performed auto-regressively.

Parameters:

decoder_model – rnnt_utils.AbstractRNNTDecoder implementation.
joint_model – rnnt_utils.AbstractRNNTJoint implementation.
blank_index – int index of the blank token. Can be 0 or len(vocabulary).
max_symbols_per_step – Optional int. The maximum number of symbols that can be added to a sequence in a single time step; if set to None then there is no limit.
preserve_alignments –
Bool flag which preserves the history of alignments generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for alignments in it. Here, alignments is a List of List of Tuple(Tensor (of length V + 1), Tensor(scalar, label after argmax)).

The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more targets from a vocabulary. U is the number of target tokens for the current timestep Ti.
preserve_frame_confidence –
Bool flag which preserves the history of per-frame confidence scores generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of List of floats.

The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more confidence scores. U is the number of target tokens for the current timestep Ti.
confidence_method_cfg –
A dict-like object which contains the method name and settings to compute per-frame confidence scores.
name: The method name (str).
Supported values:
’max_prob’ for using the maximum token probability as a confidence.

’entropy’ for using a normalized entropy of a log-likelihood vector.
entropy_type: Which type of entropy to use (str). Used if confidence_method_cfg.name is set to entropy.
Supported values:
’gibbs’ for the (standard) Gibbs entropy. If the alpha (α) is provided,
the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the alpha should comply the following inequality: (log(V)+2-sqrt(log^2(V)+4))/(2*log(V)) <= α <= (1+log(V-1))/log(V-1) where V is the model vocabulary size.

’tsallis’ for the Tsallis entropy with the Boltzmann constant one.
Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

’renyi’ for the Rényi entropy.
Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy
alpha: Power scale for logsoftmax (α for entropies). Here we restrict it to be > 0.
When the alpha equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

entropy_norm: A mapping of the entropy value to the interval [0,1].
Supported values:
’lin’ for using the linear mapping.

’exp’ for using exponential mapping with linear shift.
loop_labels – Switching between decoding algorithms. Both algorithms produce equivalent results. loop_labels=True (default) algorithm is faster (especially for large batches) but can use a bit more memory (negligible overhead compared to the amount of memory used by the encoder). loop_labels=False is an implementation of a traditional decoding algorithm, which iterates over frames (encoder output vectors), and in the inner loop, decodes labels for the current frame one by one, stopping when <blank> is found. loop_labels=True iterates over labels, on each step finding the next non-blank label (evaluating Joint multiple times in inner loop); It uses a minimal possible amount of calls to prediction network (with maximum possible batch size), which makes it especially useful for scaling the prediction network.
use_cuda_graph_decoder – if CUDA graphs should be enabled for decoding (currently recommended only for inference)
ngram_lm_model – optional n-gram language model (LM) file to use for decoding
ngram_lm_alpha – LM weight

disable_cuda_graphs()#: Disable CUDA graphs (e.g., for decoding in training)

forward( encoder_output: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: List[Hypothesis] | None = None, )#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-regressively.

Parameters:

encoder_output – A tensor of size (batch, features, timesteps).
encoded_lengths – list of int representing the length of each sequence output sequence.

Returns:

packed list containing batch number of sentences (Hypotheses).

maybe_enable_cuda_graphs()#: Enable CUDA graphs (if allowed)

class nemo.collections.asr.parts.submodules.rnnt_beam_decoding.BeamRNNTInfer( decoder_model: AbstractRNNTDecoder, joint_model: AbstractRNNTJoint, beam_size: int, search_type: str = 'default', score_norm: bool = True, return_best_hypothesis: bool = True, tsd_max_sym_exp_per_step: int | None = 50, alsd_max_target_len: int | float = 1.0, nsc_max_timesteps_expansion: int = 1, nsc_prefix_alpha: int = 1, maes_num_steps: int = 2, maes_prefix_alpha: int = 1, maes_expansion_gamma: float = 2.3, maes_expansion_beta: int = 2, language_model: Dict[str, Any] | None = None, softmax_temperature: float = 1.0, preserve_alignments: bool = False, ngram_lm_model: str | None = None, ngram_lm_alpha: float = 0.0, hat_subtract_ilm: bool = False, hat_ilm_weight: float = 0.0, max_symbols_per_step: int | None = None, blank_lm_score_mode: str | None = 'no_score', pruning_mode: str | None = 'early', allow_cuda_graphs: bool = False, )#

Bases: Typing

Beam Search implementation ported from ESPNet implementation - espnet/espnet

Sequence level beam decoding or batched-beam decoding, performed auto-repressively depending on the search type chosen.

Parameters:

decoder_model – rnnt_utils.AbstractRNNTDecoder implementation.
joint_model – rnnt_utils.AbstractRNNTJoint implementation.
beam_size –
number of beams for beam search. Must be a positive integer >= 1. If beam size is 1, defaults to stateful greedy search. This greedy search might result in slightly different results than the greedy results obtained by GreedyRNNTInfer due to implementation differences.

For accurate greedy results, please use GreedyRNNTInfer or GreedyBatchedRNNTInfer.
search_type (# The following arguments are specific to the chosen) –
str representing the type of beam search to perform. Must be one of [‘beam’, ‘tsd’, ‘alsd’]. ‘nsc’ is currently not supported.

Algoritm used:

beam - basic beam search strategy. Larger beams generally result in better decoding,
however the time required for the search also grows steadily.

tsd - time synchronous decoding. Please refer to the paper:
[Alignment-Length Synchronous Decoding for RNN Transducer] (https://ieeexplore.ieee.org/document/9053040) for details on the algorithm implemented.

Time synchronous decoding (TSD) execution time grows by the factor T * max_symmetric_expansions. For longer sequences, T is greater, and can therefore take a long time for beams to obtain good results. This also requires greater memory to execute.

alsd - alignment-length synchronous decoding. Please refer to the paper:
[Alignment-Length Synchronous Decoding for RNN Transducer] (https://ieeexplore.ieee.org/document/9053040) for details on the algorithm implemented.

Alignment-length synchronous decoding (ALSD) execution time is faster than TSD, with growth factor of T + U_max, where U_max is the maximum target length expected during execution.

Generally, T + U_max < T * max_symmetric_expansions. However, ALSD beams are non-unique, therefore it is required to use larger beam sizes to achieve the same (or close to the same) decoding accuracy as TSD.

For a given decoding accuracy, it is possible to attain faster decoding via ALSD than TSD.

maes = modified adaptive expansion searcn. Please refer to the paper:
[Accelerating RNN Transducer Inference via Adaptive Expansion Search] (https://ieeexplore.ieee.org/document/9250505)

Modified Adaptive Synchronous Decoding (mAES) execution time is adaptive w.r.t the number of expansions (for tokens) required per timestep. The number of expansions can usually be constrained to 1 or 2, and in most cases 2 is sufficient.

This beam search technique can possibly obtain superior WER while sacrificing some evaluation time.
score_norm – bool, whether to normalize the scores of the log probabilities.
return_best_hypothesis – bool, decides whether to return a single hypothesis (the best out of N), or return all N hypothesis (sorted with best score first). The container class changes based this flag - When set to True (default), returns a single Hypothesis. When set to False, returns a NBestHypotheses container, which contains a list of Hypothesis.
search_type
tsd_max_sym_exp_per_step – Used for search_type=tsd. The maximum symmetric expansions allowed per timestep during beam search. Larger values should be used to attempt decoding of longer sequences, but this in turn increases execution time and memory usage.
alsd_max_target_len – Used for search_type=alsd. The maximum expected target sequence length during beam search. Larger values allow decoding of longer sequences at the expense of execution time and memory.
stabilized. (# The following two flags are placeholders and unused until nsc implementation is)
nsc_max_timesteps_expansion – Unused int.
nsc_prefix_alpha – Unused int.
flags (# mAES)
maes_num_steps – Number of adaptive steps to take. From the paper, 2 steps is generally sufficient. int > 1.
maes_prefix_alpha – Maximum prefix length in prefix search. Must be an integer, and is advised to keep this as 1 in order to reduce expensive beam search cost later. int >= 0.
maes_expansion_beta – Maximum number of prefix expansions allowed, in addition to the beam size. Effectively, the number of hypothesis = beam_size + maes_expansion_beta. Must be an int >= 0, and affects the speed of inference since large values will perform large beam search in the next step.
maes_expansion_gamma – Float pruning threshold used in the prune-by-value step when computing the expansions. The default (2.3) is selected from the paper. It performs a comparison (max_log_prob - gamma <= log_prob[v]) where v is all vocabulary indices in the Vocab set and max_log_prob is the “most” likely token to be predicted. Gamma therefore provides a margin of additional tokens which can be potential candidates for expansion apart from the “most likely” candidate. Lower values will reduce the number of expansions (by increasing pruning-by-value, thereby improving speed but hurting accuracy). Higher values will increase the number of expansions (by reducing pruning-by-value, thereby reducing speed but potentially improving accuracy). This is a hyper parameter to be experimentally tuned on a validation set.
softmax_temperature – Scales the logits of the joint prior to computing log_softmax.
preserve_alignments –
Bool flag which preserves the history of alignments generated during beam decoding (sample). When set to true, the Hypothesis will contain the non-null value for alignments in it. Here, alignments is a List of List of Tensor (of length V + 1)

The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more targets from a vocabulary. U is the number of target tokens for the current timestep Ti.

NOTE: preserve_alignments is an invalid argument for any search_type other than basic beam search.
ngram_lm_model – str The path to the N-gram LM
ngram_lm_alpha – float Alpha weight of N-gram LM
tokens_type – str Tokenization type [‘subword’, ‘char’]

align_length_sync_decoding( h: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: Hypothesis | None = None, ) → List[Hypothesis]#

Alignment-length synchronous beam search implementation. Based on https://ieeexplore.ieee.org/document/9053040

Parameters:: h – Encoded speech features (1, T_max, D_enc)
Returns:: N-best decoding results
Return type:: nbest_hyps

compute_ngram_score( current_lm_state: kenlm.State, label: int, ) → Tuple[float, kenlm.State]#: Score computation for kenlm ngram language model.

default_beam_search( h: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: Hypothesis | None = None, ) → List[Hypothesis]#

Beam search implementation.

Parameters:: x – Encoded speech features (1, T_max, D_enc)
Returns:: N-best decoding results
Return type:: nbest_hyps

greedy_search( h: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: Hypothesis | None = None, ) → List[Hypothesis]#

Greedy search implementation for transducer. Generic case when beam size = 1. Results might differ slightly due to implementation details as compared to GreedyRNNTInfer and GreedyBatchRNNTInfer.

Parameters:: h – Encoded speech features (1, T_max, D_enc)
Returns:: 1-best decoding results
Return type:: hyp

property input_types#: Returns definitions of module input ports.

modified_adaptive_expansion_search( h: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: Hypothesis | None = None, ) → List[Hypothesis]#

Based on/modified from https://ieeexplore.ieee.org/document/9250505

Parameters:: h – Encoded speech features (1, T_max, D_enc)
Returns:: N-best decoding results
Return type:: nbest_hyps

property output_types#: Returns definitions of module output ports.

prefix_search( hypotheses: List[Hypothesis], enc_out: torch.Tensor, prefix_alpha: int, ) → List[Hypothesis]#: Prefix search for NSC and mAES strategies. Based on https://arxiv.org/pdf/1211.3711.pdf

recombine_hypotheses( hypotheses: List[Hypothesis], ) → List[Hypothesis]#

Recombine hypotheses with equivalent output sequence.

Parameters:: hypotheses (list) – list of hypotheses
Returns:: list of recombined hypotheses
Return type:: final (list)

resolve_joint_output( enc_out: torch.Tensor, dec_out: torch.Tensor, ) → Tuple[torch.Tensor, torch.Tensor]#: Resolve output types for RNNT and HAT joint models

set_decoding_type(decoding_type: str)#: Sets decoding type. Please check train_kenlm.py in scripts/asr_language_modeling/ to find out why we need :param decoding_type: decoding type

sort_nbest( hyps: List[Hypothesis], ) → List[Hypothesis]#

Sort hypotheses by score or score given sequence length.

Parameters:: hyps – list of hypotheses
Returns:: sorted list of hypotheses
Return type:: hyps

time_sync_decoding( h: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: Hypothesis | None = None, ) → List[Hypothesis]#

Time synchronous beam search implementation. Based on https://ieeexplore.ieee.org/document/9053040

Parameters:: h – Encoded speech features (1, T_max, D_enc)
Returns:: N-best decoding results
Return type:: nbest_hyps

class nemo.collections.asr.parts.submodules.rnnt_beam_decoding.BeamBatchedRNNTInfer( decoder_model: AbstractRNNTDecoder, joint_model: AbstractRNNTJoint, blank_index: int, beam_size: int, search_type: str = 'malsd_batch', score_norm: bool = True, maes_num_steps: int | None = 2, maes_expansion_gamma: float | None = 2.3, maes_expansion_beta: int | None = 2, max_symbols_per_step: int | None = 10, preserve_alignments: bool = False, ngram_lm_model: str | Path | None = None, ngram_lm_alpha: float = 0.0, blank_lm_score_mode: str | BlankLMScoreMode | None = BlankLMScoreMode.LM_WEIGHTED_FULL, pruning_mode: str | PruningMode | None = PruningMode.LATE, allow_cuda_graphs: bool | None = True, return_best_hypothesis: str | None = True, )#

Bases: Typing, ConfidenceMethodMixin

forward( encoder_output: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: list[Hypothesis] | None = None, ) → Tuple[list[Hypothesis] | List[NBestHypotheses]]#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-regressively. :param encoder_output: A tensor of size (batch, features, timesteps). :param encoded_lengths: list of int representing the length of each sequence

output sequence.

Returns:

Tuple of a list of hypotheses for each batch. Each hypothesis contains

the decoded sequence, timestamps and associated scores. The format of the returned hypotheses depends on the return_best_hypothesis attribute:

If return_best_hypothesis is True, returns the best hypothesis for each batch.

Otherwise, returns the N-best hypotheses for each batch.

Return type:

Tuple[list[Hypothesis] | List[NBestHypotheses]]

property input_types#: Returns definitions of module input ports.

property output_types#: Returns definitions of module output ports.

TDT Decoding#

class nemo.collections.asr.parts.submodules.rnnt_greedy_decoding.GreedyTDTInfer( decoder_model: AbstractRNNTDecoder, joint_model: AbstractRNNTJoint, blank_index: int, durations: list, max_symbols_per_step: int | None = None, preserve_alignments: bool = False, preserve_frame_confidence: bool = False, include_duration: bool = False, include_duration_confidence: bool = False, confidence_method_cfg: omegaconf.DictConfig | None = None, )#

Bases: _GreedyRNNTInfer

A greedy TDT decoder.

Sequence level greedy decoding, performed auto-regressively.

Parameters:

decoder_model – rnnt_utils.AbstractRNNTDecoder implementation.
joint_model – rnnt_utils.AbstractRNNTJoint implementation.
blank_index – int index of the blank token. Must be len(vocabulary) for TDT models.
durations – a list containing durations for TDT.
max_symbols_per_step – Optional int. The maximum number of symbols that can be added to a sequence in a single time step; if set to None then there is no limit.
preserve_alignments – Bool flag which preserves the history of alignments generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for alignments in it. Here, alignments is a List of List of Tuple(Tensor (of length V + 1 + num-big-blanks), Tensor(scalar, label after argmax)). The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more targets from a vocabulary. U is the number of target tokens for the current timestep Ti.
preserve_frame_confidence – Bool flag which preserves the history of per-frame confidence scores generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of List of floats. The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more confidence scores. U is the number of target tokens for the current timestep Ti.
include_duration – Bool flag, which determines whether predicted durations for each token need to be included in the Hypothesis object. Defaults to False.
include_duration_confidence – Bool flag indicating that the duration confidence scores are to be calculated and attached to the regular frame confidence, making TDT frame confidence element a pair: (prediction_confidence, duration_confidence).
confidence_method_cfg –
A dict-like object which contains the method name and settings to compute per-frame confidence scores.
name: The method name (str).
Supported values:
’max_prob’ for using the maximum token probability as a confidence.

’entropy’ for using a normalized entropy of a log-likelihood vector.
entropy_type: Which type of entropy to use (str). Used if confidence_method_cfg.name is set to entropy.
Supported values:
’gibbs’ for the (standard) Gibbs entropy. If the alpha (α) is provided,
the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the alpha should comply the following inequality: (log(V)+2-sqrt(log^2(V)+4))/(2*log(V)) <= α <= (1+log(V-1))/log(V-1) where V is the model vocabulary size.

’tsallis’ for the Tsallis entropy with the Boltzmann constant one.
Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

’renyi’ for the Rényi entropy.
Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy
alpha: Power scale for logsoftmax (α for entropies). Here we restrict it to be > 0.
When the alpha equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

entropy_norm: A mapping of the entropy value to the interval [0,1].
Supported values:
’lin’ for using the linear mapping.

’exp’ for using exponential mapping with linear shift.

forward( encoder_output: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: List[Hypothesis] | None = None, )#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-regressively. :param encoder_output: A tensor of size (batch, features, timesteps). :param encoded_lengths: list of int representing the length of each sequence

output sequence.

Returns:: packed list containing batch number of sentences (Hypotheses).

class nemo.collections.asr.parts.submodules.rnnt_greedy_decoding.GreedyBatchedTDTInfer( decoder_model: AbstractRNNTDecoder, joint_model: AbstractRNNTJoint, blank_index: int, durations: List[int], max_symbols_per_step: int | None = None, preserve_alignments: bool = False, preserve_frame_confidence: bool = False, include_duration: bool = False, include_duration_confidence: bool = False, confidence_method_cfg: omegaconf.DictConfig | None = None, use_cuda_graph_decoder: bool = True, ngram_lm_model: str | Path | None = None, ngram_lm_alpha: float = 0.0, )#

Bases: _GreedyRNNTInfer, WithOptionalCudaGraphs

A batch level greedy TDT decoder. Batch level greedy decoding, performed auto-regressively. :param decoder_model: rnnt_utils.AbstractRNNTDecoder implementation. :param joint_model: rnnt_utils.AbstractRNNTJoint implementation. :param blank_index: int index of the blank token. Must be len(vocabulary) for TDT models. :param durations: a list containing durations. :param max_symbols_per_step: Optional int. The maximum number of symbols that can be added

to a sequence in a single time step; if set to None then there is no limit.

Parameters:

preserve_alignments – Bool flag which preserves the history of alignments generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for alignments in it. Here, alignments is a List of List of Tuple(Tensor (of length V + 1 + num-big-blanks), Tensor(scalar, label after argmax)). The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more targets from a vocabulary. U is the number of target tokens for the current timestep Ti.
preserve_frame_confidence – Bool flag which preserves the history of per-frame confidence scores generated during greedy decoding (sample / batched). When set to true, the Hypothesis will contain the non-null value for frame_confidence in it. Here, frame_confidence is a List of List of floats. The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more confidence scores. U is the number of target tokens for the current timestep Ti.
include_duration – Bool flag, which determines whether predicted durations for each token need to be included in the Hypothesis object. Defaults to False.
include_duration_confidence – Bool flag indicating that the duration confidence scores are to be calculated and attached to the regular frame confidence, making TDT frame confidence element a pair: (prediction_confidence, duration_confidence).
confidence_method_cfg –
A dict-like object which contains the method name and settings to compute per-frame confidence scores.
name: The method name (str).
Supported values:
’max_prob’ for using the maximum token probability as a confidence.

’entropy’ for using a normalized entropy of a log-likelihood vector.
entropy_type: Which type of entropy to use (str). Used if confidence_method_cfg.name is set to entropy.
Supported values:
’gibbs’ for the (standard) Gibbs entropy. If the alpha (α) is provided,
the formula is the following: H_α = -sum_i((p^α_i)*log(p^α_i)). Note that for this entropy, the alpha should comply the following inequality: (log(V)+2-sqrt(log^2(V)+4))/(2*log(V)) <= α <= (1+log(V-1))/log(V-1) where V is the model vocabulary size.

’tsallis’ for the Tsallis entropy with the Boltzmann constant one.
Tsallis entropy formula is the following: H_α = 1/(α-1)*(1-sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/Tsallis_entropy

’renyi’ for the Rényi entropy.
Rényi entropy formula is the following: H_α = 1/(1-α)*log_2(sum_i(p^α_i)), where α is a parameter. When α == 1, it works like the Gibbs entropy. More: https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy
alpha: Power scale for logsoftmax (α for entropies). Here we restrict it to be > 0.
When the alpha equals one, scaling is not applied to ‘max_prob’, and any entropy type behaves like the Shannon entropy: H = -sum_i(p_i*log(p_i))

entropy_norm: A mapping of the entropy value to the interval [0,1].
Supported values:
’lin’ for using the linear mapping.

’exp’ for using exponential mapping with linear shift.
use_cuda_graph_decoder – if CUDA graphs should be enabled for decoding (currently recommended only for inference)
ngram_lm_model – optional n-gram language model (LM) file to use for decoding
ngram_lm_alpha – LM weight

disable_cuda_graphs()#: Disable CUDA graphs (e.g., for decoding in training)

forward( encoder_output: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: List[Hypothesis] | None = None, )#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-regressively. :param encoder_output: A tensor of size (batch, features, timesteps). :param encoded_lengths: list of int representing the length of each sequence

output sequence.

Returns:: packed list containing batch number of sentences (Hypotheses).

maybe_enable_cuda_graphs()#: Enable CUDA graphs (if allowed)

class nemo.collections.asr.parts.submodules.tdt_beam_decoding.BeamTDTInfer( decoder_model: AbstractRNNTDecoder, joint_model: AbstractRNNTJoint, durations: list, beam_size: int, search_type: str = 'default', score_norm: bool = True, return_best_hypothesis: bool = True, maes_num_steps: int = 2, maes_prefix_alpha: int = 1, maes_expansion_gamma: float = 2.3, maes_expansion_beta: int = 2, softmax_temperature: float = 1.0, preserve_alignments: bool = False, ngram_lm_model: str | None = None, ngram_lm_alpha: float = 0.3, max_symbols_per_step: int | None = None, blank_lm_score_mode: str | None = 'no_score', pruning_mode: str | None = 'early', allow_cuda_graphs: bool = False, )#

Bases: Typing

Beam search implementation for Token-andDuration Transducer (TDT) models.

Sequence level beam decoding or batched-beam decoding, performed auto-repressively depending on the search type chosen.

Parameters:

decoder_model – rnnt_utils.AbstractRNNTDecoder implementation.
joint_model – rnnt_utils.AbstractRNNTJoint implementation.
durations – list of duration values from TDT model.
beam_size – number of beams for beam search. Must be a positive integer >= 1. If beam size is 1, defaults to stateful greedy search. For accurate greedy results, please use GreedyRNNTInfer or GreedyBatchedRNNTInfer.
search_type (# The following arguments are specific to the chosen) –
str representing the type of beam search to perform. Must be one of [‘beam’, ‘maes’].

Algorithm used:

default - basic beam search strategy. Larger beams generally result in better decoding,
however the time required for the search also grows steadily.

maes = modified adaptive expansion search. Please refer to the paper:
[Accelerating RNN Transducer Inference via Adaptive Expansion Search] (https://ieeexplore.ieee.org/document/9250505)

Modified Adaptive Synchronous Decoding (mAES) execution time is adaptive w.r.t the number of expansions (for tokens) required per timestep. The number of expansions can usually be constrained to 1 or 2, and in most cases 2 is sufficient.

This beam search technique can possibly obtain superior WER while sacrificing some evaluation time.
score_norm – bool, whether to normalize the scores of the log probabilities.
return_best_hypothesis – bool, decides whether to return a single hypothesis (the best out of N), or return all N hypothesis (sorted with best score first). The container class changes based this flag - When set to True (default), returns a single Hypothesis. When set to False, returns a NBestHypotheses container, which contains a list of Hypothesis.
search_type
flags (# mAES)
maes_num_steps – Number of adaptive steps to take. From the paper, 2 steps is generally sufficient. int > 1.
maes_prefix_alpha – Maximum prefix length in prefix search. Must be an integer, and is advised to keep this as 1 in order to reduce expensive beam search cost later. int >= 0.
maes_expansion_beta – Maximum number of prefix expansions allowed, in addition to the beam size. Effectively, the number of hypothesis = beam_size + maes_expansion_beta. Must be an int >= 0, and affects the speed of inference since large values will perform large beam search in the next step.
maes_expansion_gamma – Float pruning threshold used in the prune-by-value step when computing the expansions. The default (2.3) is selected from the paper. It performs a comparison (max_log_prob - gamma <= log_prob[v]) where v is all vocabulary indices in the Vocab set and max_log_prob is the “most” likely token to be predicted. Gamma therefore provides a margin of additional tokens which can be potential candidates for expansion apart from the “most likely” candidate. Lower values will reduce the number of expansions (by increasing pruning-by-value, thereby improving speed but hurting accuracy). Higher values will increase the number of expansions (by reducing pruning-by-value, thereby reducing speed but potentially improving accuracy). This is a hyper parameter to be experimentally tuned on a validation set.
softmax_temperature – Scales the logits of the joint prior to computing log_softmax.
preserve_alignments –
Bool flag which preserves the history of alignments generated during beam decoding (sample). When set to true, the Hypothesis will contain the non-null value for alignments in it. Here, alignments is a List of List of Tensor (of length V + 1)

The length of the list corresponds to the Acoustic Length (T). Each value in the list (Ti) is a torch.Tensor (U), representing 1 or more targets from a vocabulary. U is the number of target tokens for the current timestep Ti.

NOTE: preserve_alignments is an invalid argument for any search_type other than basic beam search.
ngram_lm_model – str The path to the N-gram LM.
ngram_lm_alpha – float Alpha weight of N-gram LM.

compute_ngram_score( current_lm_state: kenlm.State, label: int, ) → Tuple[float, kenlm.State]#

Computes the score for KenLM Ngram language model.

Parameters:

current_lm_state – current state of the KenLM language model.
label – next label.

Returns:

score for label.

Return type:

lm_score

default_beam_search( encoder_outputs: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: Hypothesis | None = None, ) → List[Hypothesis]#

Default Beam search implementation for TDT models.

Parameters:

encoder_outputs – encoder outputs (batch, features, timesteps).
encoded_lengths – lengths of the encoder outputs.
partial_hypotheses – partial hypoteses.

Returns:

N-best decoding results

Return type:

nbest_hyps

property input_types#: Returns definitions of module input ports.

merge_duplicate_hypotheses(hypotheses)#

Merges hypotheses with identical token sequences and lengths. The combined hypothesis’s probability is the sum of the probabilities of all duplicates. Duplicate hypotheses occur when two consecutive blank tokens are predicted and their duration values sum up to the same number.

Parameters:: hypotheses – list of hypotheses.
Returns:: list if hypotheses without duplicates.
Return type:: hypotheses

modified_adaptive_expansion_search( encoder_outputs: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: Hypothesis | None = None, ) → List[Hypothesis]#

Modified Adaptive Exoansion Search algorithm for TDT models. Based on/modified from https://ieeexplore.ieee.org/document/9250505. Supports N-gram language model shallow fusion.

Parameters:

encoder_outputs – encoder outputs (batch, features, timesteps).
encoded_lengths – lengths of the encoder outputs.
partial_hypotheses – partial hypotheses.

Returns:

N-best decoding results

Return type:

nbest_hyps

property output_types#: Returns definitions of module output ports.

prefix_search( hypotheses: List[Hypothesis], encoder_output: torch.Tensor, prefix_alpha: int, ) → List[Hypothesis]#

Performs a prefix search and updates the scores of the hypotheses in place. Based on https://arxiv.org/pdf/1211.3711.pdf.

Parameters:

hypotheses – a list of hypotheses sorted by the length from the longest to the shortest.
encoder_output – encoder output.
prefix_alpha – maximum allowable length difference between hypothesis and a prefix.

Returns:

list of hypotheses with updated scores.

Return type:

hypotheses

set_decoding_type(decoding_type: str)#: Sets decoding type. Please check train_kenlm.py in scripts/asr_language_modeling/ to find out why we need :param decoding_type: decoding type

sort_nbest( hyps: List[Hypothesis], ) → List[Hypothesis]#

Sort hypotheses by score or score given sequence length.

Parameters:: hyps – list of hypotheses
Returns:: sorted list of hypotheses
Return type:: hyps

class nemo.collections.asr.parts.submodules.tdt_beam_decoding.BeamBatchedTDTInfer( decoder_model: AbstractRNNTDecoder, joint_model: AbstractRNNTJoint, durations: list, blank_index: int, beam_size: int, search_type: str = 'malsd_batch', score_norm: bool = True, max_symbols_per_step: int | None = None, preserve_alignments: bool = False, ngram_lm_model: str | Path | None = None, ngram_lm_alpha: float = 0.0, blank_lm_score_mode: str | BlankLMScoreMode | None = BlankLMScoreMode.NO_SCORE, pruning_mode: str | PruningMode | None = PruningMode.EARLY, allow_cuda_graphs: bool | None = True, return_best_hypothesis: str | None = True, )#

Bases: Typing, ConfidenceMethodMixin

forward( encoder_output: torch.Tensor, encoded_lengths: torch.Tensor, partial_hypotheses: list[Hypothesis] | None = None, ) → Tuple[list[Hypothesis] | List[NBestHypotheses]]#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-regressively. :param encoder_output: A tensor of size (batch, features, timesteps). :param encoded_lengths: list of int representing the length of each sequence

output sequence.

Returns:

Tuple of a list of hypotheses for each batch. Each hypothesis contains

the decoded sequence, timestamps and associated scores. The format of the returned hypotheses depends on the return_best_hypothesis attribute:

If return_best_hypothesis is True, returns the best hypothesis for each batch.

Otherwise, returns the N-best hypotheses for each batch.

Return type:

Tuple[list[Hypothesis] | List[NBestHypotheses]]

property input_types#: Returns definitions of module input ports.

property output_types#: Returns definitions of module output ports.

Hypotheses#

class nemo.collections.asr.parts.utils.rnnt_utils.Hypothesis( score: float, y_sequence: ~typing.List[int] | torch.Tensor, text: str | None = None, dec_out: ~typing.List[torch.Tensor] | None = None, dec_state: ~typing.List[~typing.List[torch.Tensor]] | ~typing.List[torch.Tensor] | None = None, timestamp: ~typing.List[int] | torch.Tensor = <factory>, alignments: ~typing.List[int] | ~typing.List[~typing.List[int]] | None = None, frame_confidence: ~typing.List[float] | ~typing.List[~typing.List[float]] | None = None, token_confidence: ~typing.List[float] | None = None, word_confidence: ~typing.List[float] | None = None, length: int | torch.Tensor = 0, y: ~typing.List[torch.tensor] | None = None, lm_state: ~typing.Dict[str, ~typing.Any] | ~typing.List[~typing.Any] | None = None, lm_scores: torch.Tensor | None = None, ngram_lm_state: ~typing.Dict[str, ~typing.Any] | ~typing.List[~typing.Any] | None = None, tokens: ~typing.List[int] | torch.Tensor | None = None, last_token: torch.Tensor | None = None, token_duration: torch.Tensor | None = None, last_frame: int | None = None, )#

Bases: object

Hypothesis class for beam search algorithms.

score: A float score obtained from an AbstractRNNTDecoder module’s score_hypothesis method.

y_sequence: Either a sequence of integer ids pointing to some vocabulary, or a packed torch.Tensor: behaving in the same manner. dtype must be torch.Long in the latter case.

dec_state: A list (or list of list) of LSTM-RNN decoder states. Can be None.

text: (Optional) A decoded string after processing via CTC / RNN-T decoding (removing the CTC/RNNT: blank tokens, and optionally merging word-pieces). Should be used as decoded string for Word Error Rate calculation.
timestamp: (Optional) A list of integer indices representing at which index in the decoding: process did the token appear. Should be of same length as the number of non-blank tokens.
alignments: (Optional) Represents the CTC / RNNT token alignments as integer tokens along an axis of: time T (for CTC) or Time x Target (TxU). For CTC, represented as a single list of integer indices. For RNNT, represented as a dangling list of list of integer indices. Outer list represents Time dimension (T), inner list represents Target dimension (U). The set of valid indices includes the CTC / RNNT blank token in order to represent alignments.
frame_confidence: (Optional) Represents the CTC / RNNT per-frame confidence scores as token probabilities: along an axis of time T (for CTC) or Time x Target (TxU). For CTC, represented as a single list of float indices. For RNNT, represented as a dangling list of list of float indices. Outer list represents Time dimension (T), inner list represents Target dimension (U).
token_confidence: (Optional) Represents the CTC / RNNT per-token confidence scores as token probabilities: along an axis of Target U. Represented as a single list of float indices.
word_confidence: (Optional) Represents the CTC / RNNT per-word confidence scores as token probabilities: along an axis of Target U. Represented as a single list of float indices.
length: Represents the length of the sequence (the original length without padding), otherwise: defaults to 0.

y: (Unused) A list of torch.Tensors representing the list of hypotheses.

lm_state: (Unused) A dictionary state cache used by an external Language Model.

lm_scores: (Unused) Score of the external Language Model.

ngram_lm_state: (Optional) State of the external n-gram Language Model.

tokens: (Optional) A list of decoded tokens (can be characters or word-pieces.

last_token (Optional): A token or batch of tokens which was predicted in the last step.

last_frame (Optional): Index of the last decoding step hypothesis was updated including blank token prediction.

class nemo.collections.asr.parts.utils.rnnt_utils.NBestHypotheses( n_best_hypotheses: List[Hypothesis] | None, )#

Bases: object

List of N best hypotheses

Adapter Networks#

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.MultiHeadAttentionAdapter(*args: Any, **kwargs: Any)#

Bases: MultiHeadAttention, AdapterModuleUtil

Multi-Head Attention layer of Transformer.

Parameters:

n_head (int) – number of heads
n_feat (int) – size of the features
dropout_rate (float) – dropout rate
proj_dim – Optional integer value for projection before computing attention. If None, then there is no projection (equivalent to proj_dim = n_feat). If > 0, then will project the n_feat to proj_dim before calculating attention. If <0, then will equal n_head, so that each head has a projected dimension of 1.

forward( query, key, value, mask, pos_emb=None, cache=None, )#

Compute ‘Scaled Dot Product Attention’. :param query: (batch, time1, size) :type query: torch.Tensor :param key: (batch, time2, size) :type key: torch.Tensor :param value: (batch, time2, size) :type value: torch.Tensor :param mask: (batch, time1, time2) :type mask: torch.Tensor :param cache: (batch, time_cache, size) :type cache: torch.Tensor

Returns:: transformed value (batch, time1, d_model) weighted by the query dot key attention cache (torch.Tensor) : (batch, time_cache_next, size)
Return type:: output (torch.Tensor)

get_default_strategy_config() → dataclass#: Returns a default adapter module strategy.

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.RelPositionMultiHeadAttentionAdapter(*args: Any, **kwargs: Any)#

Bases: RelPositionMultiHeadAttention, AdapterModuleUtil

Multi-Head Attention layer of Transformer-XL with support of relative positional encoding. Paper: https://arxiv.org/abs/1901.02860

Parameters:

n_head (int) – number of heads
n_feat (int) – size of the features
dropout_rate (float) – dropout rate
proj_dim (int, optional) – Optional integer value for projection before computing attention. If None, then there is no projection (equivalent to proj_dim = n_feat). If > 0, then will project the n_feat to proj_dim before calculating attention. If <0, then will equal n_head, so that each head has a projected dimension of 1.
adapter_strategy – By default, MHAResidualAddAdapterStrategyConfig. An adapter composition function object.

forward( query, key, value, mask, pos_emb, cache=None, )#

Compute ‘Scaled Dot Product Attention’ with rel. positional encoding. :param query: (batch, time1, size) :type query: torch.Tensor :param key: (batch, time2, size) :type key: torch.Tensor :param value: (batch, time2, size) :type value: torch.Tensor :param mask: (batch, time1, time2) :type mask: torch.Tensor :param pos_emb: (batch, time1, size) :type pos_emb: torch.Tensor :param cache: (batch, time_cache, size) :type cache: torch.Tensor

Returns:: transformed value (batch, time1, d_model) weighted by the query dot key attention cache_next (torch.Tensor) : (batch, time_cache_next, size)
Return type:: output (torch.Tensor)

get_default_strategy_config() → dataclass#: Returns a default adapter module strategy.

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.PositionalEncodingAdapter(*args: Any, **kwargs: Any)#

Bases: PositionalEncoding, AdapterModuleUtil

Absolute positional embedding adapter.

Note

Absolute positional embedding value is added to the input tensor without residual connection ! Therefore, the input is changed, if you only require the positional embedding, drop the returned x !

Parameters:

d_model (int) – The input dimension of x.
max_len (int) – The max sequence length.
xscale (float) – The input scaling factor. Defaults to 1.0.
adapter_strategy (AbstractAdapterStrategy) – By default, ReturnResultAdapterStrategyConfig. An adapter composition function object. NOTE: Since this is a positional encoding, it will not add a residual !

get_default_strategy_config() → dataclass#: Returns a default adapter module strategy.

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.RelPositionalEncodingAdapter(*args: Any, **kwargs: Any)#

Bases: RelPositionalEncoding, AdapterModuleUtil

Relative positional encoding for TransformerXL’s layers See : Appendix B in https://arxiv.org/abs/1901.02860

Note

Relative positional embedding value is not added to the input tensor ! Therefore, the input should be updated changed, if you only require the positional embedding, drop the returned x !

Parameters:

d_model (int) – embedding dim
max_len (int) – maximum input length
xscale (bool) – whether to scale the input by sqrt(d_model)
adapter_strategy – By default, ReturnResultAdapterStrategyConfig. An adapter composition function object.

get_default_strategy_config() → dataclass#: Returns a default adapter module strategy.

Adapter Strategies#

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.MHAResidualAddAdapterStrategy( stochastic_depth: float = 0.0, l2_lambda: float = 0.0, )#

Bases: ResidualAddAdapterStrategy

An implementation of residual addition of an adapter module with its input for the MHA Adapters.

forward( input: dict, adapter: torch.nn.Module, *, module: AdapterModuleMixin, )#

A basic strategy, comprising of a residual connection over the input, after forward pass by the underlying adapter. Additional work is done to pack and unpack the dictionary of inputs and outputs.

Note: The value tensor is added to the output of the attention adapter as the residual connection.

Parameters:

input –
A dictionary of multiple input arguments for the adapter module.

query, key, value: Original output tensor of the module, or the output of the
previous adapter (if more than one adapters are enabled).

mask: Attention mask.

pos_emb: Optional positional embedding for relative encoding.
adapter – The adapter module that is currently required to perform the forward pass.
module – The calling module, in its entirety. It is a module that implements AdapterModuleMixin, therefore the strategy can access all other adapters in this module via module.adapter_layer.

Returns:

The result tensor, after one of the active adapters has finished its forward passes.

compute_output( input: torch.Tensor, adapter: torch.nn.Module, *, module: AdapterModuleMixin, ) → torch.Tensor#

Compute the output of a single adapter to some input.

Parameters:

input – Original output tensor of the module, or the output of the previous adapter (if more than one adapters are enabled).
adapter – The adapter module that is currently required to perform the forward pass.
module – The calling module, in its entirety. It is a module that implements AdapterModuleMixin, therefore the strategy can access all other adapters in this module via module.adapter_layer.

Returns:

The result tensor, after one of the active adapters has finished its forward passes.