NeMo SSL Configuration Files

This page covers NeMo configuration file setup that is specific to models in the Speech Self-Supervised Pre-training collection. For general information about how to set up and run experiments that is common to all NeMo models (e.g. experiment manager and PyTorch Lightning trainer parameters), see the NeMo Models page.

Dataset configuration for self-supervised model is mostly the same as for standard ASR training, covered here. The main difference is that in order to perform contrastive loss, we will need to mask an equivalent amount of patches for all utterances in a batch. This means that we want to avoid the durations varying too significantly within a single batch. There are several ways you can achieve this in NeMo:

1) The simplest way is to use the min_duration parameter in the dataset config, which will simply discard all utterances below the specified length. This is a viable option if removing these utterances will not significantly impact the total amount of hours of your dataset.

2) If your dataset contains many long utterances (longer than ~16 seconds) with varying length, then you may instead want to use the random_segment perturbation, which will sample segments of a certain length from the full sample at runtime (samples below the provided segment length will be padded). You can enable this by adding the following to your dataset config:

Copy
Copied!
            

augmentor: random_segment: prob: 1.0 duration_sec: 16 # specify the duration you want

3) You can also use bucketing to ensure similar utterance lengths within batches. See Bucketing documentation.

An example of SSL train and validation configuration should look similar to the following:

Copy
Copied!
            

model: train_ds: manifest_filepath: ??? sample_rate: ${model.sample_rate} batch_size: 16 # you may increase batch_size if your memory allows shuffle: true num_workers: 8 pin_memory: false use_start_end_token: true trim_silence: false max_duration: 16.7 min_duration: 8.0 # tarred datasets is_tarred: false tarred_audio_filepaths: null shuffle_n: 2048 # bucketing params bucketing_strategy: "synced_randomized" bucketing_batch_size: null validation_ds: manifest_filepath: ??? sample_rate: ${model.sample_rate} batch_size: 16 # you may increase batch_size if your memory allows shuffle: false num_workers: 8 pin_memory: true use_start_end_token: false min_duration: 8.0

Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. For details on how to write this section, refer to Preprocessor Configuration

For self-supervised pre-training, we recommend using the MaskedPatchAugmentation class for spectrogram masking. This augmentation divides utterances into fixed size patches, and then masks a fixed amount/fraction of them. You can also add freq_masks and freq_width to apply masking to frequency bands.

If you are using contrastive loss with negatives sampled from masked steps in same utterance only, make sure that the total amount of masked steps in each utterance will be big enough for the number of sampled negatives. For example, if you are using 4x stride and want to sample 100 negatives, then you will need more than 400 masked steps. If you are using the default patch_size of 48, then this means you will need to set mask_patches to at least 9. When using a fraction of the total amount of patches instead of a fixed amount, you will need to make sure that the minimum duration of your samples in large enough for the number of negatives to sample.

Copy
Copied!
            

spec_augment: _target_: nemo.collections.asr.modules.MaskedPatchAugmentation patch_size: 48 # size of a single patch mask_patches: 0.5 # fraction of patches to mask (can be fixed int amount instead) freq_masks: 3 # Cut three frequency bands freq_width: 20 # ... of width 20 at maximum

Each configuration file should describe the model architecture being used for the experiment. For self-supervised pre-training, we will typically train the encoder of the model and then re-use it for fine-tuning, so the encoder can be configured in the same way as you would for an ASR model. Note that any ASR model encoder can be used with any of the available pre-training methods, though, given the same model sizes, we find the best downstream results when using Conformer.

Unlike the encoders, the decoders and corresponding losses will be specific to the self-supervised pre-training, and are small enough that you can discard them when transferring the model to fine-tuning.

The most basic method of pre-training we can use is to have the model solve a contrastive task (this is the approach used in wav2vec 2.0 [SSL-MODELS1]) We can define the corresponding decoder and loss configs in the following way for an encoder with stride 4x.

Copy
Copied!
            

decoder_out: 128 decoder: _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction feat_in: ${model.encoder.d_model} feat_hidden: 128 feat_out: ${model.decoder_out} stride_layers: 0 # if loss.combine_time_steps is less than the encoder stride, then a corresponding amount of stride_layers needs to # be added to the decoder (here stride and combine_time_steps are both 4) non_stride_layers: 0 loss: _target_: nemo.collections.asr.losses.ContrastiveLoss in_dim: ${model.preprocessor.features} proj_dim: ${model.decoder_out} combine_time_steps: 4 # how many spectrogram time steps are used for one target/representation for contrastive task quantized_targets: true # should quantizer or linear layer be used codebook_size: 300 # size of a single codebook for quantizer num_groups: 2 # number of codebooks to use for quantizer num_negatives: 100 # number of sampled negatives for each target sample_from_same_utterance_only: true # should negatives be sampled only from the same utterance sample_from_non_masked: false # should negatives be sampled from non-masked steps

Note that in the above example we combine 4 steps from the input spectrogram into a single “token” for the loss, which corresponds to the encoder stride 4x. We might want to use different values for “combine_time_steps” and encoder stride. In that case, we will need to add stride layers to decoders to match the strides. We can use the following example config for a Citrinet encoder with stride 8x. In order to go from stride 8x to 4x, we use a single stride_layer in the decoder with stride_transpose set to True.

Copy
Copied!
            

decoder: _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction feat_in: ${model.model_defaults.enc_final} feat_hidden: 128 feat_out: ${model.model_defaults.decoder_out_channels} stride_layers: 1 #if loss.combine_time_steps is less than the encoder stride, then a corresponding amount of stride_layers needs to #be added to the decoder (here stride is 8 and combine_time_steps is 4, so 1 stride layer is added) non_stride_layers: 0 stride_tranpose: true # whether to use transposed convolution for stride layers or not loss: _target_: nemo.collections.asr.losses.ContrastiveLoss in_dim: *n_mels proj_dim: ${model.model_defaults.decoder_out_channels} combine_time_steps: 4 #how many spectrogram time steps are used for one target/representation for contrastive task quantized_targets: false #should quantizer or linear layer be used sample_from_same_utterance_only: true #should negatives be sampled only from the same utterance sample_from_non_masked: false #should negatives be sampled from non-masked steps

It can be beneficial to combine contrastive loss with other losses, such as a masked language modeling (mlm) loss (similar approach to W2V-Bert [SSL-MODELS2]). In order to do this, instead of specifying a single decoder and loss in the config, we can specify a loss_list, which can contain any amount of corresponding decoders and losses. For each decoder-loss pair, we can specify a separate named sub-config, which contains the following fields:

  1. decoder - The decoder config, specifying a target class and parameters.

  2. loss - The corresponding loss config, specifying a target class and parameters.

  3. loss_alpha - A multiplier on this loss (1.0 by default).

  4. targets_from_loss - This parameter specifies which contrastive loss we should extract labels from. It is necessary for any loss which requires labels, if labels aren’t present in your manifest.

  5. transpose_encoded - This parameter is used to optionally transpose the encoded features before passing them into this loss.

  6. start_step - The training step at which we should start using this decoder+loss.

  7. output_from_layer - This parameter can be used to specify the name of the layer that we should extract encoded features from to pass into this decoder. If it’s not specified or set to null, the final encoder layer is used.

The following is an example of a loss_list for a combination of contrastive+mlm losses, where the mlm loss uses targets from the quantization module of the contrastive loss.

Copy
Copied!
            

decoder_out: 128 loss_list: contrastive: decoder: _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction feat_in: ${model.encoder.d_model} feat_hidden: 128 # features in hidden layer of decoder feat_out: ${model.decoder_out} stride_layers: 0 # if loss.combine_time_steps is less than the encoder stride, then a corresponding amount of stride_layers needs to # be added to the decoder (here stride and combine_time_steps are both 4) non_stride_layers: 0 loss: _target_: nemo.collections.asr.losses.ContrastiveLoss in_dim: ${model.preprocessor.features} proj_dim: ${model.decoder_out} combine_time_steps: 4 # how many spectrogram time steps are used for one target/representation for contrastive task quantized_targets: true # should quantizer or linear layer be used # (quantizer is required to extract pseudo-labels for other losses) codebook_size: 300 num_groups: 2 sample_from_same_utterance_only: true # should negatives be sampled only from the same utterance sample_from_non_masked: false # should negatives be sampled from non-masked steps mlm: decoder: _target_: nemo.collections.asr.modules.ConvASRDecoder feat_in: ${model.encoder.d_model} num_classes: 90000 # set this to be equal to codebook_size^groups in the contrastive loss loss: _target_: nemo.collections.asr.losses.MLMLoss combine_time_steps: 4 targets_from_loss: "contrastive" # since this loss requires targets, we can either get them from a manifest or from a quantized contrastive loss loss_alpha: 1000. # multiplier applied to this loss relative to others transpose_encoded: false # transposing input may be necessary depending on which layer is used as input to decoder start_step: 0 # determines what global step this loss starts being used at; # this can be set to a higher number if your training is long enough, # which may increase early training stability output_from_layer: null # if we wanted to use outputs from non-final encoder layer as input to this decoder, # the layer name should be specified here

We can also use other losses which require labels instead of mlm, such as ctc or rnnt loss. Since these losses, unlike mlm, don’t require our targets to have a direct alignment with our steps, we may also want to use set the reduce_ids parameter of the contrastive loss to true, to convert any sequence of consecutive equivalent ids to a single occurrence of that id.

An example of a loss_list consisting of contrastive+ctc loss can look like this:

Copy
Copied!
            

decoder_out: 128 loss_list: contr: decoder: _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction feat_in: ${model.encoder.d_model} feat_hidden: 128 feat_out: ${model.decoder_out} stride_layers: 0 non_stride_layers: 0 loss: _target_: nemo.collections.asr.losses.ContrastiveLoss in_dim: ${model.preprocessor.features} proj_dim: ${model.decoder_out} combine_time_steps: 4 quantized_targets: true codebook_size: 300 num_groups: 2 sample_from_same_utterance_only: true sample_from_non_masked: false reduce_ids: true ctc: decoder: _target_: nemo.collections.asr.modules.ConvASRDecoder feat_in: ${model.encoder.d_model} num_classes: 90000 loss: _target_: nemo.collections.asr.losses.CTCLossForSSL num_classes: 90000 targets_from_loss: "contr" start_step: 3000

An example of contrastive+rnnt can look like this:

Copy
Copied!
            

decoder_out: 128 loss_list: contr: decoder: _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction feat_in: ${model.encoder.d_model} feat_hidden: 128 feat_out: ${model.decoder_out} stride_layers: 0 non_stride_layers: 0 loss: _target_: nemo.collections.asr.losses.ContrastiveLoss in_dim: ${model.preprocessor.features} proj_dim: ${model.decoder_out} combine_time_steps: 4 quantized_targets: true codebook_size: 24 sample_from_same_utterance_only: true sample_from_non_masked: false reduce_ids: true rnnt: decoder: _target_: nemo.collections.asr.modules.RNNTDecoderJointSSL decoder: _target_: nemo.collections.asr.modules.RNNTDecoder normalization_mode: null # Currently only null is supported for export. random_state_sampling: false # Random state sampling: https://arxiv.org/pdf/1910.11455.pdf blank_as_pad: true # This flag must be set in order to support exporting of RNNT models + efficient inference. vocab_size: 576 prednet: pred_hidden: 640 pred_rnn_layers: 1 t_max: null dropout: 0.1 joint: _target_: nemo.collections.asr.modules.RNNTJoint log_softmax: null # 'null' would set it automatically according to CPU/GPU device preserve_memory: false # dramatically slows down training, but might preserve some memory experimental_fuse_loss_wer: false jointnet: encoder_hidden: 512 pred_hidden: 640 joint_hidden: 640 activation: "relu" dropout: 0.1 num_classes: 576 loss: _target_: nemo.collections.asr.losses.RNNTLossForSSL num_classes: 576 targets_from_loss: "contr" start_step: 1000

We can also use multiple losses, which use features from different intermediate layers of the encoder as input [SSL-MODELS3]. In the following config example, we use contrastive loss + three different mlm losses, which use encoder outputs respectively from 6th, 12th and final layer.

Copy
Copied!
            

decoder_out: 128 loss_list: contr: decoder: _target_: nemo.collections.asr.modules.ConvASRDecoderReconstruction feat_in: ${model.encoder.d_model} feat_hidden: 128 feat_out: ${model.decoder_out} stride_layers: 0 non_stride_layers: 0 loss: _target_: nemo.collections.asr.losses.ContrastiveLoss in_dim: ${model.preprocessor.features} proj_dim: ${model.decoder_out} combine_time_steps: 4 quantized_targets: true codebook_size: 300 sample_from_same_utterance_only: true sample_from_non_masked: false loss_alpha: 5. mlm: decoder: _target_: nemo.collections.asr.modules.ConvASRDecoder feat_in: ${model.encoder.d_model} num_classes: 90000 loss: _target_: nemo.collections.asr.losses.MLMLoss combine_time_steps: 4 targets_from_loss: "contr" loss_alpha: 1000. mlm_2: decoder: _target_: nemo.collections.asr.modules.ConvASRDecoder feat_in: ${model.encoder.d_model} num_classes: 90000 loss: _target_: nemo.collections.asr.losses.MLMLoss combine_time_steps: 4 targets_from_loss: "contr" loss_alpha: 300. output_from_layer: "layers.5" transpose_encoded: true mlm_3: decoder: _target_: nemo.collections.asr.modules.ConvASRDecoder feat_in: ${model.encoder.d_model} num_classes: 90000 loss: _target_: nemo.collections.asr.losses.MLMLoss combine_time_steps: 4 targets_from_loss: "contr" loss_alpha: 300. output_from_layer: "layers.11" transpose_encoded: true

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0: a framework for self-supervised learning of speech representations. 2020. URL: https://arxiv.org/abs/2006.11477, doi:10.48550/ARXIV.2006.11477.

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: combining contrastive learning and masked language modeling for self-supervised speech pre-training. 2021. URL: https://arxiv.org/abs/2108.06209, doi:10.48550/ARXIV.2108.06209.

Chengyi Wang, Yu Wu, Sanyuan Chen, Shujie Liu, Jinyu Li, Yao Qian, and Zhenglu Yang. Self-supervised learning for speech recognition with intermediate layer supervision. 2021. URL: https://arxiv.org/abs/2112.08778, doi:10.48550/ARXIV.2112.08778.

Previous Checkpoints
Next NeMo SSL collection API
© Copyright 2023-2024, NVIDIA. Last updated on Apr 22, 2024.