NeMo SSL Configuration Files#

This page covers NeMo configuration file setup that is specific to models in the Speech Self-Supervised Pre-training collection. For general information about how to set up and run experiments that is common to all NeMo models (e.g. experiment manager and PyTorch Lightning trainer parameters), see the NeMo Models page.

Dataset Configuration#

Dataset configuration for self-supervised model is mostly the same as for standard ASR training, covered here. The main difference is that in order to perform contrastive loss, we will need to mask an equivalent amount of patches for all utterances in a batch. This means that we want to avoid the durations varying too significantly within a single batch. There are several ways you can achieve this in NeMo:

1) The simplest way is to use the min_duration parameter in the dataset config, which will simply discard all utterances below the specified length. This is a viable option if removing these utterances will not significantly impact the total amount of hours of your dataset.

2) If your dataset contains many long utterances (longer than ~16 seconds) with varying length, then you may instead want to use the random_segment perturbation, which will sample segments of a certain length from the full sample at runtime (samples below the provided segment length will be padded). You can enable this by adding the following to your dataset config:

augmentor:
  random_segment:
    prob: 1.0
    duration_sec: 16 # specify the duration you want

3) You can also use bucketing to ensure similar utterance lengths within batches. See Bucketing documentation.

An example of SSL train and validation configuration should look similar to the following:

model:
  train_ds:
    manifest_filepath: ???
    sample_rate: ${model.sample_rate}
    batch_size: 16 # you may increase batch_size if your memory allows
    shuffle: true
    num_workers: 8
    pin_memory: false
    use_start_end_token: true
    trim_silence: false
    max_duration: 16.7
    min_duration: 8.0
    # tarred datasets
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    # bucketing params
    bucketing_strategy: "synced_randomized"
    bucketing_batch_size: null

  validation_ds:
    manifest_filepath: ???
    sample_rate: ${model.sample_rate}
    batch_size: 16 # you may increase batch_size if your memory allows
    shuffle: false
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    min_duration: 8.0

Preprocessor Configuration#

Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. For details on how to write this section, refer to Preprocessor Configuration

Augmentation Configurations#

For self-supervised pre-training, we recommend using the MaskedPatchAugmentation class for spectrogram masking. This augmentation divides utterances into fixed size patches, and then masks a fixed amount/fraction of them. You can also add freq_masks and freq_width to apply masking to frequency bands.

If you are using contrastive loss with negatives sampled from masked steps in same utterance only, make sure that the total amount of masked steps in each utterance will be big enough for the number of sampled negatives. For example, if you are using 4x stride and want to sample 100 negatives, then you will need more than 400 masked steps. If you are using the default patch_size of 48, then this means you will need to set mask_patches to at least 9. When using a fraction of the total amount of patches instead of a fixed amount, you will need to make sure that the minimum duration of your samples in large enough for the number of negatives to sample.

spec_augment:
  _target_: nemo.collections.asr.modules.MaskedPatchAugmentation
  patch_size: 48 # size of a single patch
  mask_patches: 0.5 # fraction of patches to mask (can be fixed int amount instead)
  freq_masks: 3 # Cut three frequency bands
  freq_width: 20 # ... of width 20 at maximum

Model Architecture Configurations#

Each configuration file should describe the model architecture being used for the experiment. For self-supervised pre-training, we will typically train the encoder of the model and then re-use it for fine-tuning, so the encoder can be configured in the same way as you would for an ASR model. Note that any ASR model encoder can be used with any of the available pre-training methods, though, given the same model sizes, we find the best downstream results when using Conformer.

Unlike the encoders, the decoders and corresponding losses will be specific to the self-supervised pre-training, and are small enough that you can discard them when transferring the model to fine-tuning.

The most basic method of pre-training we can use is to have the model solve a contrastive task (this is the approach used in wav2vec 2.0 [SSL-MODELS1]) We can define the corresponding decoder and loss configs in the following way for an encoder with stride 4x.

decoder_out: 128

decoder:
  _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction
  feat_in: ${model.encoder.d_model}
  feat_hidden: 128
  feat_out: ${model.decoder_out}
  stride_layers: 0
  # if loss.combine_time_steps is less than the encoder stride, then a corresponding amount of stride_layers needs to
  # be added to the decoder (here stride and combine_time_steps are both 4)
  non_stride_layers: 0

loss:
  _target_: nemo.collections.asr.losses.ContrastiveLoss
  in_dim: ${model.preprocessor.features}
  proj_dim: ${model.decoder_out}
  combine_time_steps: 4 # how many spectrogram time steps are used for one target/representation for contrastive task
  quantized_targets: true # should quantizer or linear layer be used
  codebook_size: 300 # size of a single codebook for quantizer
  num_groups: 2 # number of codebooks to use for quantizer
  num_negatives: 100 # number of sampled negatives for each target
  sample_from_same_utterance_only: true # should negatives be sampled only from the same utterance
  sample_from_non_masked: false # should negatives be sampled from non-masked steps

Note that in the above example we combine 4 steps from the input spectrogram into a single “token” for the loss, which corresponds to the encoder stride 4x. We might want to use different values for “combine_time_steps” and encoder stride. In that case, we will need to add stride layers to decoders to match the strides. We can use the following example config for a Citrinet encoder with stride 8x. In order to go from stride 8x to 4x, we use a single stride_layer in the decoder with stride_transpose set to True.

decoder:
  _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction
  feat_in: ${model.model_defaults.enc_final}
  feat_hidden: 128
  feat_out: ${model.model_defaults.decoder_out_channels}
  stride_layers: 1
  #if loss.combine_time_steps is less than the encoder stride, then a corresponding amount of stride_layers needs to
  #be added to the decoder (here stride is 8 and combine_time_steps is 4, so 1 stride layer is added)
  non_stride_layers: 0
  stride_tranpose: true # whether to use transposed convolution for stride layers or not

loss:
  _target_: nemo.collections.asr.losses.ContrastiveLoss
  in_dim: *n_mels
  proj_dim: ${model.model_defaults.decoder_out_channels}
  combine_time_steps: 4 #how many spectrogram time steps are used for one target/representation for contrastive task
  quantized_targets: false #should quantizer or linear layer be used
  sample_from_same_utterance_only: true #should negatives be sampled only from the same utterance
  sample_from_non_masked: false #should negatives be sampled from non-masked steps

It can be beneficial to combine contrastive loss with other losses, such as a masked language modeling (mlm) loss (similar approach to W2V-Bert [SSL-MODELS2]). In order to do this, instead of specifying a single decoder and loss in the config, we can specify a loss_list, which can contain any amount of corresponding decoders and losses. For each decoder-loss pair, we can specify a separate named sub-config, which contains the following fields:

  1. decoder - The decoder config, specifying a target class and parameters.

  2. loss - The corresponding loss config, specifying a target class and parameters.

  3. loss_alpha - A multiplier on this loss (1.0 by default).

  4. targets_from_loss - This parameter specifies which contrastive loss we should extract labels from. It is necessary for any loss which requires labels, if labels aren’t present in your manifest.

  5. transpose_encoded - This parameter is used to optionally transpose the encoded features before passing them into this loss.

  6. start_step - The training step at which we should start using this decoder+loss.

  7. output_from_layer - This parameter can be used to specify the name of the layer that we should extract encoded features from to pass into this decoder. If it’s not specified or set to null, the final encoder layer is used.

The following is an example of a loss_list for a combination of contrastive+mlm losses, where the mlm loss uses targets from the quantization module of the contrastive loss.

decoder_out: 128

loss_list:
  contrastive:
    decoder:
      _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction
      feat_in: ${model.encoder.d_model}
      feat_hidden: 128
      # features in hidden layer of decoder
      feat_out: ${model.decoder_out}
      stride_layers: 0
      # if loss.combine_time_steps is less than the encoder stride, then a corresponding amount of stride_layers needs to
      # be added to the decoder (here stride and combine_time_steps are both 4)
      non_stride_layers: 0
    loss:
      _target_: nemo.collections.asr.losses.ContrastiveLoss
      in_dim: ${model.preprocessor.features}
      proj_dim: ${model.decoder_out}
      combine_time_steps: 4 # how many spectrogram time steps are used for one target/representation for contrastive task
      quantized_targets: true # should quantizer or linear layer be used
      # (quantizer is required to extract pseudo-labels for other losses)
      codebook_size: 300
      num_groups: 2
      sample_from_same_utterance_only: true # should negatives be sampled only from the same utterance
      sample_from_non_masked: false # should negatives be sampled from non-masked steps

  mlm:
    decoder:
      _target_: nemo.collections.asr.modules.ConvASRDecoder
      feat_in: ${model.encoder.d_model}
      num_classes: 90000
      # set this to be equal to codebook_size^groups in the contrastive loss
    loss:
      _target_: nemo.collections.asr.losses.MLMLoss
      combine_time_steps: 4
    targets_from_loss: "contrastive"
    # since this loss requires targets, we can either get them from a manifest or from a quantized contrastive loss
    loss_alpha: 1000.
    # multiplier applied to this loss relative to others
    transpose_encoded: false
    # transposing input may be necessary depending on which layer is used as input to decoder
    start_step: 0
    # determines what global step this loss starts being used at;
    # this can be set to a higher number if your training is long enough,
    # which may increase early training stability
    output_from_layer: null
    # if we wanted to use outputs from non-final encoder layer as input to this decoder,
    # the layer name should be specified here

We can also use other losses which require labels instead of mlm, such as ctc or rnnt loss. Since these losses, unlike mlm, don’t require our targets to have a direct alignment with our steps, we may also want to use set the reduce_ids parameter of the contrastive loss to true, to convert any sequence of consecutive equivalent ids to a single occurrence of that id.

An example of a loss_list consisting of contrastive+ctc loss can look like this:

decoder_out: 128

loss_list:
  contr:
    decoder:
      _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction
      feat_in: ${model.encoder.d_model}
      feat_hidden: 128
      feat_out: ${model.decoder_out}
      stride_layers: 0
      non_stride_layers: 0
    loss:
      _target_: nemo.collections.asr.losses.ContrastiveLoss
      in_dim: ${model.preprocessor.features}
      proj_dim: ${model.decoder_out}
      combine_time_steps: 4
      quantized_targets: true
      codebook_size: 300
      num_groups: 2
      sample_from_same_utterance_only: true
      sample_from_non_masked: false
      reduce_ids: true

  ctc:
    decoder:
      _target_: nemo.collections.asr.modules.ConvASRDecoder
      feat_in: ${model.encoder.d_model}
      num_classes: 90000
    loss:
      _target_: nemo.collections.asr.losses.CTCLossForSSL
      num_classes: 90000
    targets_from_loss: "contr"
    start_step: 3000

An example of contrastive+rnnt can look like this:

decoder_out: 128

loss_list:
  contr:
    decoder:
      _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction
      feat_in: ${model.encoder.d_model}
      feat_hidden: 128
      feat_out: ${model.decoder_out}
      stride_layers: 0
      non_stride_layers: 0
    loss:
      _target_: nemo.collections.asr.losses.ContrastiveLoss
      in_dim: ${model.preprocessor.features}
      proj_dim: ${model.decoder_out}
      combine_time_steps: 4
      quantized_targets: true
      codebook_size: 24
      sample_from_same_utterance_only: true
      sample_from_non_masked: false
      reduce_ids: true

  rnnt:
    decoder:
      _target_: nemo.collections.asr.modules.RNNTDecoderJointSSL
      decoder:
        _target_: nemo.collections.asr.modules.RNNTDecoder
        normalization_mode: null # Currently only null is supported for export.
        random_state_sampling: false # Random state sampling: https://arxiv.org/pdf/1910.11455.pdf
        blank_as_pad: true # This flag must be set in order to support exporting of RNNT models + efficient inference.
        vocab_size: 576
        prednet:
          pred_hidden: 640
          pred_rnn_layers: 1
          t_max: null
          dropout: 0.1
      joint:
        _target_: nemo.collections.asr.modules.RNNTJoint
        log_softmax: null  # 'null' would set it automatically according to CPU/GPU device
        preserve_memory: false  # dramatically slows down training, but might preserve some memory
        experimental_fuse_loss_wer: false
        jointnet:
          encoder_hidden: 512
          pred_hidden: 640
          joint_hidden: 640
          activation: "relu"
          dropout: 0.1
        num_classes: 576
    loss:
      _target_: nemo.collections.asr.losses.RNNTLossForSSL
      num_classes: 576
    targets_from_loss: "contr"
    start_step: 1000

We can also use multiple losses, which use features from different intermediate layers of the encoder as input [SSL-MODELS3]. In the following config example, we use contrastive loss + three different mlm losses, which use encoder outputs respectively from 6th, 12th and final layer.

decoder_out: 128

loss_list:
  contr:
    decoder:
      _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction
      feat_in: ${model.encoder.d_model}
      feat_hidden: 128
      feat_out: ${model.decoder_out}
      stride_layers: 0
      non_stride_layers: 0
    loss:
      _target_: nemo.collections.asr.losses.ContrastiveLoss
      in_dim: ${model.preprocessor.features}
      proj_dim: ${model.decoder_out}
      combine_time_steps: 4
      quantized_targets: true
      codebook_size: 300
      sample_from_same_utterance_only: true
      sample_from_non_masked: false
    loss_alpha: 5.

  mlm:
    decoder:
      _target_: nemo.collections.asr.modules.ConvASRDecoder
      feat_in: ${model.encoder.d_model}
      num_classes: 90000
    loss:
      _target_: nemo.collections.asr.losses.MLMLoss
      combine_time_steps: 4
    targets_from_loss: "contr"
    loss_alpha: 1000.

  mlm_2:
    decoder:
      _target_: nemo.collections.asr.modules.ConvASRDecoder
      feat_in: ${model.encoder.d_model}
      num_classes: 90000
    loss:
      _target_: nemo.collections.asr.losses.MLMLoss
      combine_time_steps: 4
    targets_from_loss: "contr"
    loss_alpha: 300.
    output_from_layer: "layers.5"
    transpose_encoded: true

  mlm_3:
    decoder:
      _target_: nemo.collections.asr.modules.ConvASRDecoder
      feat_in: ${model.encoder.d_model}
      num_classes: 90000
    loss:
      _target_: nemo.collections.asr.losses.MLMLoss
      combine_time_steps: 4
    targets_from_loss: "contr"
    loss_alpha: 300.
    output_from_layer: "layers.11"
    transpose_encoded: true

References#

SSL-MODELS1

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0: a framework for self-supervised learning of speech representations. 2020. URL: https://arxiv.org/abs/2006.11477, doi:10.48550/ARXIV.2006.11477.

SSL-MODELS2

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: combining contrastive learning and masked language modeling for self-supervised speech pre-training. 2021. URL: https://arxiv.org/abs/2108.06209, doi:10.48550/ARXIV.2108.06209.

SSL-MODELS3

Chengyi Wang, Yu Wu, Sanyuan Chen, Shujie Liu, Jinyu Li, Yao Qian, and Zhenglu Yang. Self-supervised learning for speech recognition with intermediate layer supervision. 2021. URL: https://arxiv.org/abs/2112.08778, doi:10.48550/ARXIV.2112.08778.