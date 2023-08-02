The spec file for ASR using Conformer includes the trainer , save_to , model , training_ds , validation_ds , and optim parameters. The following is a shortened example of a spec file for training on the Mozilla Common Voice English dataset.

Copy Copied! trainer: max_epochs: 100 tlt_checkpoint_interval: 1 # Name of the .tlt file where the trained Conformer model will be saved save_to: trained-model.tlt # Specifies parameters for the ASR model model: log_prediction: true # enables logging sample predictions in the output during training ctc_reduction: 'mean_batch' # Parameters for sub-word tokenization tokenizer: dir: ??? type: "bpe" # Can be either bpe or wpe # Parameters for the audio to spectrogram preprocessor. preprocessor: _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor sample_rate: 16000 normalize: per_feature window_size: 0.025 window_stride: 0.01 window: hann features: 80 n_fft: 512 frame_splicing: 1 dither: 1.0e-05 pad_to: 0 pad_value: 0.0 # This adds spectrogram augmentation to the training process. spec_augment: _target_: nemo.collections.asr.modules.SpectrogramAugmentation freq_masks: 2 # set to zero to disable it # you may use lower time_masks for smaller models to have a faster convergence time_masks: 5 # set to zero to disable it freq_width: 27 time_width: 0.05 # The encoder and decoder sections specify your model architecture. encoder: _target_: nemo.collections.asr.modules.ConformerEncoder feat_in: 80 feat_out: -1 # you may set it if you need different output size other than the default d_model n_layers: 16 d_model: 176 # Sub-sampling params subsampling: striding # vggnet or striding, vggnet may give better results but needs more memory subsampling_factor: 4 # must be power of 2 subsampling_conv_channels: -1 # -1 sets it to d_model # Feed forward module's params ff_expansion_factor: 4 # Multi-headed Attention Module's params self_attention_model: rel_pos # rel_pos or abs_pos n_heads: 4 # may need to be lower for smaller d_models # [left, right] specifies the number of steps to be seen from left and right of each step in self-attention xscaling: true # scales up the input embeddings by sqrt(d_model) untie_biases: true # unties the biases of the TransformerXL layers pos_emb_max_len: 5000 # Convolution module's params conv_kernel_size: 31 conv_norm_type: 'batch_norm' # batch_norm or layer_norm ### regularization dropout: 0.1 # The dropout used in most of the Conformer Modules dropout_emb: 0.0 # The dropout used for embeddings dropout_att: 0.1 # The dropout for multi-headed attention modules decoder: _target_: nemo.collections.asr.modules.ConvASRDecoder feat_in: null num_classes: -1 # filled with vocabulary size from tokenizer at runtime vocabulary: [] # filled with vocabulary from tokenizer at runtime # This section specifies the dataset to be used for training. training_ds: # No need to specify an audio file path, since the manifest entries contain individual # utterances' file paths. manifest_filepath: /data/cv-corpus-5.1-2020-06-22/en/train.json sample_rate: 16000 batch_size: 32 shuffle: true use_start_end_token: false trim_silence: false # Setting a max duration trims out files that are longer than the max. max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset min_duration: 0.1 # The is_tarred and tarred_audio_filepaths parameters should be specified if using a tarred dataset. is_tarred: false tarred_audio_filepaths: null # bucketing params bucketing_strategy: "synced_randomized" bucketing_batch_size: null # Specifies the dataset to be used for validation. validation_ds: manifest_filepath: /data/cv-corpus-5.1-2020-06-22/en/dev.json sample_rate: 16000 batch_size: 32 shuffle: false num_workers: 8 pin_memory: true use_start_end_token: false # The parameters for the training optimizer, including learning rate, lr schedule, etc. optim: name: adamw lr: 5.0 # optimizer arguments betas: [0.9, 0.98] # less necessity for weight_decay as we already have large augmentations with SpecAug # you may need weight_decay for large models, stable AMP training, small datasets, or when lower augmentations are used # weight decay of 0.0 with lr of 2.0 also works fine weight_decay: 1e-3 # scheduler setup sched: name: NoamAnnealing d_model: ${model.encoder.d_model} # scheduler config override warmup_steps: 10000 warmup_ratio: null min_lr: 1e-6

The specification can be grouped into roughly three categories:

Parameters that describe the training process

Parameters that describe the datasets

Parameters that describe the model

This specification can be used with the tao speech_to_text_conformer train command. Only a dataset parameter is required for tao speech_to_text_conformer evaluate , though a checkpoint must be provided.

If you would like to change a parameter for your run without changing the specification file itself, you can specify it on the command line directly. For example, to change the validation batch size, add validation_ds.batch_size=1 to your command, which will override the batch size of 32 in the configuration shown above. An example of this is shown in the training instructions below.

There are a few parameters that help specify the parameters of your training run, which are detailed in the following table.

Parameter Datatype Description Supported Values max_epochs int Specifies the maximum number of epochs to train the model. A field for the trainer parameter. >0 save_to string The location to save the trained model checkpoint. THis should be in the form path/to/target/location/modelname.tlt . A valid path optim Specifies the optimizer to be used for training, as well as the parameters to configure it: name (String): The optimizer to use

lr (float): The learning rate. This parameter must be specified.

sched : The learning rate schedule, if desired If your chosen optimizer takes additional arguments, they can be placed under lr , as shown in the example above with betas and weight_decay . tlt_checkpoint_interval int The interval (number of epochs) at which to save the .tlt checkpoint during training. >=0 (0 means no checkpoint)

There is also a early_stopping config, which enables early stopping during training. It has the following parameters.

Parameter Datatype Description Supported Values monitor string The metric to monitor in order to enable early stopping. val_loss or val_wer patience int The number of checks of the monitor value before stopping the training. Positive integer min_delta float The delta of the minimum value of the monitor value, below which it is considered not decreasing. Positive float

The datasets that you use should be specified by <xyz>_ds parameters, depending on your use case:

For training using tao speech_to_text_conformer train , use training_ds to describe the training dataset and validation_ds to describe the validation dataset.

For evaluation using tao speech_to_text_conformer evaluate , use test_ds to describe the test dataset.

For fine-tuning using tao speech_to_text_conformer finetune , use finetuning_ds to describe the fine-tuning training dataset and validation_ds to describe the validation dataset.

The fields for each dataset config are described in the following table.

Parameter Datatype Description Supported Values manifest_filepath string The filepath to the manifest ( .json file) that describes the audio data A valid filepath sample_rate int The target sample rate (in kHz)to load the audio batch_size int The batch size. This may depend on memory size and how long the audio samples are. >0 trim_silence bool Specifies whether or not to trim silence from the beginning and end of each audio signal. The default value is False. True/False min_duration float All files with a duration less than the given value (in seconds) will be dropped. The default value is 0.1. max_duration float All files with a duration greater than the given value (in seconds) will be dropped. shuffle bool Specifies whether or not to shuffle the data. We recommend using True for training data and False for validation. True/False use_start_end_token bool Specifies whether or not to to add [BOS] and [EOS] tokens to the beginning and end of speech respectively True/False is_tarred bool Specifies whether the audio files in the dataset are contained in a tarfile ( .tar ). If so, you must also set tarred_audio_filepaths and, if you would like the data to be shuffled, set shuffle_n . The default value is False. True/False tarred_audio_filepaths string The path to the tarfile ( .tar ) that contains the audio samples associated with the entries in manifest_filepath . Only set this parameter if is_tarred is set to True. A valid filepath shuffle_n int The number of audio samples to load at once from the tarfile for shuffling. For example, if set to 100 when batch size is 25, the data loader will load the next 100 samples in the tarfile, shuffle them, and use the shuffled order for the next four batches. bucketing_strategy string Enables bucketing during training if specified. Only set this parameter if is_tarred is set to True.

When bucketing_strategy is set, it reduces the number of paddings in | fixed_order

each batch and speeds up the training significantly without hurting the accuracy significantly. * fixed_order: The same order of buckets is used for all epochs

* sycned_randomized (default): Order of the buckets is shuffled at every epoch

* fully_randomized: Similar to synced_randomized, but each GPU has its own random order. So GPUs would not be synced. synced_randomized fully_randomized bucketing_batchsize int The number of audio samples in each bucket. Only set this parameter if is_tarred is set to True. When bucketing_batch_size is set, training_ds.batch_size needs to be set to 1. bucketing_batch_size can be set as an integer or a list of integers to explicitly specify the batch size for each bucket. If bucketing_batch_size is set to be an integer, then linear scaling is used to scale-up the batch sizes for batches with a shorted audio size. For example, setting train_ds.bucketing_batch_size=8 for 4 buckets would use sizes [32,24,16,8] for different buckets.

The Conformer model architecture and configuration are set under the model parameter. This includes general parameters, including the following:

Logging

Parameters for tokenizer, which defines the tokenizer type and path for sub-word tokenization

Parameters for the audio preprocessor, which determines how audio signals are transformed to spectrograms

Spectrogram augmentation, which adds a data augmentation step to the pipeline

The encoder of the model

The decoder of the model

Parameter Datatype Description Supported Values log_prediction string Whether a random sample should be printed in the output at each step, along with its predicted transcript. A valid path ctc_reduction string The reduction type of CTC loss. The default setting is mean_batch , which takes the average over the batch after taking the average over the length of each sample.

The tokenizer parameters are as follows:

Parameter Datatype Description Supported Values dir string The root path to the tokneizer model. This path is presumably created by the create_tokenizer command. A valid path type string The tokenizer type, which can be either “bpe” or “wpe”.

The preprocessor parameters are as follows:

Parameter Datatype Description Supported Values normalize string The normalization process for each spectrogram. Defaults to per_feature . per_feature : Normalizes each spectrogram per channel/frequency.

all_features : Normalizes over the entire spectrogram to be mean 0 with std 1.

Any other value: Disables normalization. sample_rate int The sample rate of the input audio data in kHz. This should match the sample rates of your datasets. The default value is 16000. window_size float The window size for FFT in seconds. The default value is 0.02. window_stride float The window stride for FFT in seconds. The default value is 0.01. window string The windowing function for FFT. The default value is hann . hann , hamming , blackman , bartlett features int The number of mel spectrogram frequency bins to output. The default value is 64. n_fft int The length of the FFT window. frame_splicing int The number of frames to stack across the feature dimension. Setting this to 1 disables frame splicing. The default value is 1. dither float The amount of white-noise dithering. The default value is 1e-5. pad_to int Ensures that the output size of the time dimension is a multiple of pad_to. The default value is 16. pad_value float The value that shorter mels are padded with. The default value is 0. stft_conv bool If set to True, uses pytorch_stft and convolutions. If set to False, uses torch.stft . The default setting is False.

If you wish to add spectrogram augmentation to your model, include a spec_augment block. Within this block, you can specify parameters for time and frequency cuts for augmentation, as described by SpecAugment and Cutout.

Parameter Datatype Description Supported Values rect_masks int The number of rectangular masks to cut (Cutout). The default value is 5. rect_freq int The maximum size of cut rectangles along the frequency dimension. This parameter should only be set if rect_masks is set. The default value is 5. rect_time int The maximum size of cut rectangles along the time dimension. This parameter should only be set if rect_masks is set. The default value is 25. freq_masks int The number of frequency segments to cut (SpecAugment). The default value is 0. freq_width int int The maximum number of frequencies to cut in one segment. This parameter should only be set if rect_masks is set. The default value is 10. time_masks int The number of time segments to cut (SpecAugment). The default value is 0. time_width int int The maximum number of time steps to cut in one segment. This parameter should only be set if rect_masks is set. The default value is 10.

The encoder parameters for the model include details about the Conformer encoder architecture, including how many blocks to use and how many times to repeat each block.

The encoder parameters are detailed in the following table.

Parameter Datatype Description Supported Values feat_in int The number of input features. This value should be equal to features in the preprocessor parameters. n_layers int The number of layers of ConformerBlock d_model int The hidden size of the model feat_out int The size of the output features. The default value is -1, which sets it to d_model subsampling string The method of subsampling. The default value is striding. vggnet/striding subsampling_factor int The subsampling factor, which should be power of 2. The default value is 4. subsampling_conv_channels int The size of the convolutions in the subsampling module. The default value is -1, which sets it to d_model ff_expansion_factor int The expansion factor in feed forward layers. The default value is 4. self_attention_model string Type of the attention layer and positional encoding. The default setting is rel_pos * rel_pos : Relative positional embedding and Transformer-XL

* abs_pos : Absolute positional embedding and Transformer rel_pos/abs_pos n_heads int The number of heads in multi-headed attention layers. The default value is 4. xscaling bool If True, scales the inputs to the multi-headed attention layers by sqrt( d_model ). The default setting is True. untie_biases bool If True, shares (unties) the bias weights between layers of Transformer-XL. The default setting is True. pos_emb_max_len int The maximum length of positional embeddings. The default value is 5000. conv_kernel_size int The size of the convolutions in the convolutional modules. The default value is 31. conv_norm_type string The type of the normalization in the convolutional modules. The default value is ‘batch_norm’. dropout float The dropout rate used in all layers except the attention layers. The default value is 0.1. dropout_emb float The dropout rate used for the positional embeddings. The default value is 0.1. dropout_att float The dropout rate used for the attention layer. The default value is 0.0.

The decoder parameters are detailed in the following table.