NeMo ASR Configuration Files

This section describes the NeMo configuration file setup that is specific to models in the ASR collection. For general information about how to set up and run experiments that is common to all NeMo models (e.g. Experiment Manager and PyTorch Lightning trainer parameters), see the NeMo Models section.

The model section of the NeMo ASR configuration files generally requires information about the dataset(s) being used, the preprocessor for audio files, parameters for any augmentation being performed, as well as the model architecture specification. The sections on this page cover each of these in more detail.

Example configuration files for all of the NeMo ASR scripts can be found in the config directory of the examples.

Dataset Configuration

Training, validation, and test parameters are specified using the train_ds, validation_ds, and test_ds sections in the configuration file, respectively. Depending on the task, there may be arguments specifying the sample rate of the audio files, the vocabulary of the dataset (for character prediction), whether or not to shuffle the dataset, and so on. You may also decide to leave fields such as the manifest_filepath blank, to be specified via the command-line at runtime.

Any initialization parameter that is accepted for the Dataset class used in the experiment can be set in the config file. Refer to the Datasets section of the API for a list of Datasets and their respective parameters.

An example ASR train and validation configuration should look similar to the following:

# Specified at the beginning of the config file
labels: &labels [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
         "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]

model:
  train_ds:
    manifest_filepath: ???
    sample_rate: 16000
    labels: *labels   # Uses the labels above
    batch_size: 32
    trim_silence: True
    max_duration: 16.7
    shuffle: True
    is_tarred: False  # If set to true, uses the tarred version of the Dataset
    tarred_audio_filepaths: null      # Not used if is_tarred is false
    tarred_shard_strategy: "scatter"  # Not used if is_tarred is false
    num_workers: 8
    pin_memory: true

  validation_ds:
    manifest_filepath: ???
    sample_rate: 16000
    labels: *labels   # Uses the labels above
    batch_size: 32
    shuffle: False    # No need to shuffle the validation data
    num_workers: 8
    pin_memory: true

Preprocessor Configuration

If you are loading audio files for your experiment, you will likely want to use a preprocessor to convert from the raw audio signal to features (e.g. mel-spectrogram or MFCC). The preprocessor section of the config specifies the audio preprocessor to be used via the _target_ field, as well as any initialization parameters for that preprocessor.

An example of specifying a preprocessor is as follows:

model:
  ...
  preprocessor:
    # _target_ is the audio preprocessor module you want to use
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    normalize: "per_feature"
    window_size: 0.02
    ...
    # Other parameters for the preprocessor

Refer to the Audio Preprocessors API section for the preprocessor options, expected arguments, and defaults.

Augmentation Configurations

There are a few on-the-fly spectrogram augmentation options for NeMo ASR, which can be specified by the configuration file using a spec_augment section.

For example, there are options for Cutout and SpecAugment available via the SpectrogramAugmentation module.

The following example sets up both Cutout (via the rect_* parameters) and SpecAugment (via the freq_* and time_* parameters).

model:
  ...
  spec_augment:
    _target_: nemo.collections.asr.modules.SpectrogramAugmentation
    # Cutout parameters
    rect_masks: 5   # Number of rectangles to cut from any given spectrogram
    rect_freq: 50   # Max cut of size 50 along the frequency dimension
    rect_time: 120  # Max cut of size 120 along the time dimension
    # SpecAugment parameters
    freq_masks: 2   # Cut two frequency bands
    freq_width: 15  # ... of width 15 at maximum
    time_masks: 5    # Cut out 10 time bands
    time_width: 25  # ... of width 25 at maximum

You can use any combination of Cutout, frequency/time SpecAugment, or neither of them.

With NeMo ASR, you can also add augmentation pipelines that can be used to simulate various kinds of noise added to audio in the channel. Augmentors in a pipeline are applied on the audio data read in the data layer. Online augmentors can be specified in the config file using an augmentor section in train_ds. The following example adds an augmentation pipeline that first adds white noise to an audio sample with a probability of 0.5 and at a level randomly picked between -50 dB and -10 dB and then passes the resultant samples through a room impulse response randomly picked from the manifest file provided for impulse augmentation in the config file.

model:
  ...
  train_ds:
  ...
      augmentor:
          white_noise:
              prob: 0.5
              min_level: -50
              max_level: -10
          impulse:
              prob: 0.3
              manifest_path: /path/to/impulse_manifest.json

Refer to the Audio Augmentors API section for more details.

Tokenizer Configurations

Some models utilize sub-word encoding via an external tokenizer instead of explicitly defining their vocabulary.

For such models, a tokenizer section is added to the model config. ASR models currently support two types of custom tokenizers:

  • Google Sentencepiece tokenizers (tokenizer type of bpe in the config)

  • HuggingFace WordPiece tokenizers (tokenizer type of wpe in the config)

In order to build custom tokenizers, refer to the ASR_with_Subword_Tokenization notebook available in the ASR tutorials directory.

The following example sets up a SentencePiece Tokenizer at a path specified by the user:

model:
  ...
  tokenizer:
    dir: "<path to the directory that contains the custom tokenizer files>"
    type: "bpe"  # can be "bpe" or "wpe"

For models which utilize sub-word tokenization, we share the decoder module (ConvASRDecoder) with character tokenization models. All parameters are shared, but for models which utilize sub-word encoding, there are minor differences when setting up the config. For such models, the tokenizer is utilized to fill in the missing information when the model is constructed automatically.

For example, a decoder config corresponding to a sub-word tokenization model should look similar to the following:

model:
  ...
  decoder:
    _target_: nemo.collections.asr.modules.ConvASRDecoder
    feat_in: *enc_final
    num_classes: -1  # filled with vocabulary size from tokenizer at runtime
    vocabulary: []  # filled with vocabulary from tokenizer at runtime

Model Architecture Configurations

Each configuration file should describe the model architecture being used for the experiment. Models in the NeMo ASR collection need an encoder section and a decoder section, with the _target_ field specifying the module to use for each.

Here is the list of the parameters in the model section which are shared among most of the ASR models:

Parameter

Datatype

Description

Supported Values

log_prediction

bool

Whether a random sample should be printed in the output at each step, along with its predicted transcript.

ctc_reduction

string

Specifies the reduction type of CTC loss. Defaults to mean_batch which would take the average over the batch after taking the average over the length of each sample.

none, mean_batch mean, sum

The following sections go into more detail about the specific configurations of each model architecture.

For more information about the ASR models, refer to the Models section.

Jasper and QuartzNet

The Jasper and QuartzNet models are very similar, and as such the components in their configs are very similar as well.

Both architectures use the ConvASREncoder for the encoder, with parameters detailed in the table below. The encoder parameters include details about the Jasper/QuartzNet [BxR] encoder architecture, including how many blocks to use (B), how many times to repeat each sub-block (R), and the convolution parameters for each block.

The number of blocks B is determined by the number of list elements under jasper minus the one prologue and two epilogue blocks. The number of sub-blocks R is determined by setting the repeat parameter.

To use QuartzNet (which uses more compact time-channel separable convolutions) instead of Jasper, add separable: true to all but the last block in the architecture.

Change the parameter name jasper.

Parameter

Datatype

Description

Supported Values

feat_in

int

The number of input features. Should be equal to features in the preprocessor parameters.

activation

string

Which activation function to use in the encoder.

hardtanh, relu, selu, swish

conv_mask

bool

Whether to use masked convolutions in the encoder. Defaults to true.

jasper

A list of blocks that specifies your encoder architecture. Each entry in this list represents one block in the architecture and contains the parameters for that block, including convolution parameters, dropout, and the number of times the block is repeated. Refer to the Jasper and QuartzNet papers for details about specific model configurations.

A QuartzNet 15x5 (fifteen blocks, each sub-block repeated five times) encoder configuration should look similar to the following example:

# Specified at the beginning of the file for convenience
n_mels: &n_mels 64    # Used for both the preprocessor and encoder as number of input features
repeat: &repeat 5     # R=5
dropout: &dropout 0.0
separable: &separable true  # Set to true for QN. Set to false for Jasper.

model:
  ...
  encoder:
    _target_: nemo.collections.asr.modules.ConvASREncoder
    feat_in: *n_mels  # Should match "features" in the preprocessor.
    activation: relu
    conv_mask: true

    jasper:   # This field name should be "jasper" for both types of models.

    # Prologue block
    - dilation: [1]
      dropout: *dropout
      filters: 256
      kernel: [33]
      repeat: 1   # Prologue block is not repeated.
      residual: false
      separable: *separable
      stride: [2]

    # Block 1
    - dilation: [1]
      dropout: *dropout
      filters: 256
      kernel: [33]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    ... # Entries for blocks 2~14

    # Block 15
    - dilation: [1]
      dropout: *dropout
      filters: 512
      kernel: [75]
      repeat: *repeat
      residual: true
      separable: *separable
      stride: [1]

    # Two epilogue blocks
    - dilation: [2]
      dropout: *dropout
      filters: 512
      kernel: [87]
      repeat: 1   # Epilogue blocks are not repeated
      residual: false
      separable: *separable
      stride: [1]

    - dilation: [1]
      dropout: *dropout
      filters: &enc_filters 1024
      kernel: [1]
      repeat: 1   # Epilogue blocks are not repeated
      residual: false
      stride: [1]

Both Jasper and QuartzNet use the ConvASRDecoder as the decoder. The decoder parameters are detailed in the following table.

For example, a decoder config corresponding to the encoder above should look similar to the following:

model:
  ...
  decoder:
    _target_: nemo.collections.asr.modules.ConvASRDecoder
    feat_in: *enc_filters
    vocabulary: *labels
    num_classes: 28   # Length of the vocabulary list

Citrinet

The Citrinet and QuartzNet models are very similar, and as such the components in their configs are very similar as well. Citrinet utilizes Squeeze and Excitation, as well as sub-word tokenization, in contrast to QuartzNet. Depending on the dataset, we utilize different tokenizers. For Librispeech, we utilize the HuggingFace WordPiece tokenizer, and for all other datasets we utilize the Google Sentencepiece tokenizer - usually the unigram tokenizer type.

Both architectures use the ConvASREncoder for the encoder, with parameters detailed above. The encoder parameters include details about the Citrinet-C encoder architecture, including how many filters are used per channel (C). The Citrinet-C configuration is a shortform notation for Citrinet-21x5xC, such that B = 21 and R = 5 are the default and should generally not be changed.

To use Citrinet instead of QuartzNet, refer to the citrinet_512.yaml configuration found inside the examples/asr/conf/citrinet directory. Citrinet is primarily comprised of the same JasperBlock as Jasper or ``QuartzNet`.

While the configs for Citrinet and QuartzNet are similar, we note the additional flags used for Citrinet below. Refer to the JasperBlock documentation for the meaning of these arguments.

A Citrinet-512 config should look similar to the following:

model:
  ...
  # Specify some defaults across the entire model
  model_defaults:
    repeat: 5
    dropout: 0.1
    separable: true
    se: true
    se_context_size: -1
  ...
  encoder:
    _target_: nemo.collections.asr.modules.ConvASREncoder
    feat_in: *n_mels  # Should match "features" in the preprocessor.
    activation: relu
    conv_mask: true

    jasper:   # This field name should be "jasper" for the JasperBlock (which constructs Citrinet).

    # Prologue block
    - filters: 512
      repeat: 1
      kernel: [5]
      stride: [1]
      dilation: [1]
      dropout: 0.0
      residual: false
      separable: ${model.model_defaults.separable}
      se: ${model.model_defaults.se}
      se_context_size: ${model.model_defaults.se_context_size}

    # Block 1
    - filters: 512
      repeat: ${model.model_defaults.repeat}
      kernel: [11]
      stride: [2]
      dilation: [1]
      dropout: ${model.model_defaults.dropout}
      residual: true
      separable: ${model.model_defaults.separable}
      se: ${model.model_defaults.se}
      se_context_size: ${model.model_defaults.se_context_size}
      stride_last: true
      residual_mode: "stride_add"

    ... # Entries for blocks 2~21

    # Block 22
    - filters: 512
      repeat: ${model.model_defaults.repeat}
      kernel: [39]
      stride: [1]
      dilation: [1]
      dropout: ${model.model_defaults.dropout}
      residual: true
      separable: ${model.model_defaults.separable}
      se: ${model.model_defaults.se}
      se_context_size: ${model.model_defaults.se_context_size}

    # Epilogue block

    - filters: &enc_final 640
      repeat: 1
      kernel: [41]
      stride: [1]
      dilation: [1]
      dropout: 0.0
      residual: false
      separable: ${model.model_defaults.separable}
      se: ${model.model_defaults.se}
      se_context_size: ${model.model_defaults.se_context_size}

As mentioned above, Citrinet uses the ConvASRDecoder as the decoder layer similar to QuartzNet. Only the configuration must be changed slightly as Citrinet utilizes sub-word tokenization.

Note

The following information is relevant to any of the above models that implements its encoder as an ConvASREncoder, and utilizes the SqueezeExcite mechanism.

The SqueezeExcite block within a ConvASREncoder network can be modified to utilize a different context window after the model has been instantiated (even after the model has been trained) so as to evaluate the model with limited context. This can be achieved using the change_conv_asr_se_context_window()

# Here, model can be any model that has a `ConvASREncoder` as its encoder, and utilized `SqueezeExcite` blocks
# `context_window` : It is an integer representing the number of timeframes (each corresponding to some window stride).
# `update_config` : Bool flag which determines whether the config of the model should be updated to reflect the new context window.

# Here, we specify that 128 timeframes of 0.01s stride should be the context window
# This is equivalent to 128 * 0.01s context window for `SqueezeExcite`
model.change_conv_asr_se_context_window(context_window=128, update_config=True)

Conformer-CTC

The config files for Conformer-CTC model contain character-based encoding and sub-word encoding at <NeMo_git_root>/examples/asr/conf/conformer/conformer_ctc_char.yaml and <NeMo_git_root>/examples/asr/conf/conformer/conformer_ctc_bpe.yaml respectively. Some components of the configs of Conformer-CTC include the following datasets:

  • train_ds, validation_ds, and test_ds

  • opimizer (optim)

  • augmentation (spec_augment)

  • decoder

  • trainer

exp_manager

These datasets are similar to other ASR models like QuartzNet. There should be a tokenizer section where you can specify the tokenizer if you want to use sub-word encoding instead of character-based encoding.

The encoder section includes the details about the Conformer-CTC encoder architecture. You may find more information in the config files and also nemo.collections.asr.modules.ConformerEncoder.