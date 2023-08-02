The spec file for ASR using CitriNet includes the trainer , save_to , model , training_ds , validation_ds , and optim parameters. The following is a shortened example of a spec file for training on the Mozilla Common Voice English dataset.

Copy Copied! trainer: max_epochs: 100 tlt_checkpoint_interval: 1 # Name of the .tlt file where the trained CitriNet model will be saved save_to: trained-model.tlt # Specifies parameters for the ASR model model: # Parameters for sub-word tokenization tokenizer: dir: ??? type: "bpe" # Can be either bpe or wpe # Parameters for the audio to spectrogram preprocessor. preprocessor: _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor normalize: per_feature sample_rate: 16000 window_size: 0.02 window_stride: 0.01 window: hann features: 64 n_fft: 512 frame_splicing: 1 dither: 1.0e-05 stft_conv: false # This adds spectrogram augmentation to the training process. spec_augment: _target_: nemo.collections.asr.modules.SpectrogramAugmentation rect_masks: 5 rect_freq: 50 rect_time: 120 # The encoder and decoder sections specify your model architecture. encoder: _target_: nemo.collections.asr.modules.ConvASREncoder feat_in: 64 activation: relu conv_mask: true # Several blocks were cut out here for brevity. jasper: - filters: 128 repeat: 1 kernel: [11] stride: [1] dilation: [1] dropout: 0.0 residual: true separable: true se: true se_context_size: -1 #... (Add more blocks to describe the model) - filters: &enc_feat_out 1024 repeat: 1 kernel: [1] stride: [1] dilation: [1] dropout: 0.0 residual: false separable: true se: true se_context_size: -1 decoder: _target_: nemo.collections.asr.modules.ConvASRDecoder feat_in: 1024 num_classes: -1 # filled with vocabulary size from tokenizer at runtime vocabulary: [] # filled with vocabulary from tokenizer at runtime # This section specifies the dataset to be used for training. training_ds: # No need to specify an audio file path, since the manifest entries contain individual # utterances' file paths. manifest_filepath: /data/cv-corpus-5.1-2020-06-22/en/train.json sample_rate: 16000 batch_size: 32 trim_silence: true # Setting a max duration trims out files that are longer than the max. max_duration: 16.7 shuffle: true # The is_tarred and tarred_audio_filepaths parameters should be specified if using a tarred dataset. is_tarred: false tarred_audio_filepaths: null # Specifies the dataset to be used for validation. validation_ds: manifest_filepath: /data/cv-corpus-5.1-2020-06-22/en/dev.json sample_rate: 16000 batch_size: 32 shuffle: false # The parameters for the training optimizer, including learning rate, lr schedule, etc. optim: name: adam lr: .1 # optimizer arguments betas: [0.9, 0.999] weight_decay: 0.0001 # scheduler setup sched: name: CosineAnnealing # scheduler config override warmup_steps: null warmup_ratio: 0.05 min_lr: 1e-6 last_epoch: -1

The specification can be roughly grouped into three categories:

Parameters that describe the training process

Parameters that describe the datasets, and

Parameters that describe the model.

This specification can be used with the tao speech_to_text_citrinet train command. Only a dataset parameter is required for tao speech_to_text_citrinet evaluate , though a checkpoint must be provided.

If you would like to change a parameter for your run without changing the specification file itself, you can specify it on the command line directly. For example, if you would like to change the validation batch size, you can add validation_ds.batch_size=1 to your command, which would override the batch size of 32 in the configuration shown above. An example of this is shown in the training instructions below.

There are a few parameters that help specify the parameters of your training run, which are detailed in the following table.

Parameter Datatype Description Supported Values max_epochs int Specifies the maximum number of epochs to train the model. A field for the trainer parameter. >0 save_to string The location to save the trained model checkpoint. Should be in the form path/to/target/location/modelname.tlt . Valid paths. optim Specifies the optimizer to be used for training, as well as the parameters to configure it, including: name (String): Which optimizer to use.

lr (float): The learning rate. Must be specified.

sched : Specifies learning rate schedule, if desired. If your chosen optimizer takes additional arguments, they can be placed under lr , as seen in the example above with betas and weight_decay . tlt_checkpoint_interval int The interval (number of epochs) to save the .tlt checkpoint during training. >=0 (0 means no checkpoint)

There is also a config named early_stopping to enable early stopping during training. It has the following parameters.

Parameter Datatype Description Supported Values monitor string The metric to monitor in order to enable early stopping. val_loss or val_wer patience int The number of checks of monitor value before stopping the training. Positive integers min_delta float The delta of the minimum value of monitor value below which we regard it as not decreasing. Positive floats.

The datasets that you use should be specified by <xyz>_ds parameters, depending on your use case:

For training using tao speech_to_text_citrinet train , you should have training_ds to describe your training dataset, and validation_ds to describe your validation dataset.

For evaluation using tao speech_to_text_citrinet evaluate , you should have test_ds to describe your test dataset.

For fine-tuning using tao speech_to_text_citrinet finetune , you should have finetuning_ds to describe the fine-tuning training dataset, and validation_ds to describe your validation dataset.

The fields for each dataset config are described in the following table.

Parameter Datatype Description Supported Values manifest_filepath string The filepath to the manifest (.json file) that describes the audio data. Valid filepaths. sample_rate int Target sample rate to load the audio, in kHz. batch_size int Batch size. This may depend on memory size and how long your audio samples are. >0 trim_silence bool Whether or not to trim silence from the beginning and end of each audio signal. Defaults to false if no value is set. True/False min_duration float All files with a duration less than the given value (in seconds) will be dropped. Defaults to 0.1. max_duration float All files with a duration greater than the given value (in seconds) will be dropped. shuffle bool Whether or not to shuffle the data. We recommend true for training data, and false for validation. True/False is_tarred bool Whether the audio files in the dataset are contained in a tarfile (.tar). If so, you must also set tarred_audio_filepaths , and set shuffle_n if you would like the data to be shuffled. Defaults to false. True/False tarred_audio_filepaths string Only to be set if is_tarred is set to true. Path to the tarfile (.tar) that contains the audio samples associated with the entries in manifest_filepath . Valid filepaths. shuffle_n int Only to be set if is_tarred is set to true. The number of audio samples to load at once from the tarfile for shuffling. For example, if set to 100 when batch size is 25, the data loader will load the next 100 samples in the tarfile, shuffle them, and use the shuffled order for the next four batches.

Your CitriNet model architecture and configuration are set under the model parameter. This includes parameters for tokenizer, which defines the tokenizer type and path for sub-word tokenization; parameters for the audio preprocessor, which determines how audio signals are transformed to spectrograms; spectrogram augmentation, which adds a data augmentation step to the pipeline; the encoder of the model; and the decoder of the model.

The tokenizer parameters are as follows:

Parameter Datatype Description Supported Values dir string Root path to the tokneizer model. This path is assumed to be created by the create_tokenizer command. Valid path. type string Type of the tokenizer, either ‘bpe’ or ‘wpe’.

The preprocessor parameters are as follows:

Parameter Datatype Description Supported Values normalize string How to normalize each spectrogram. Defaults to per_feature . per_feature : Normalizes each spectrogram per channel/frequency.

all_features : Normalizes over the entire spectrogram to be mean 0 with std 1.

Any other value: Disables normalization. sample_rate int Sample rate of the input audio data in kHz. This should match your datasets’ sample rates. Defaults to 16000. window_size float Window size for FFT in seconds. Defaults to 0.02. window_stride float Window stride for FFT in seconds. Defaults to 0.01. window string Windowing function for FFT. Defaults to hann . hann , hamming , blackman , bartlett features int Number of mel spectrogram frequency bins to output. Defaults to 64. n_fft int Length of FFT window. frame_splicing int How many frames to stack across the feature dimension. Setting this to 1 disables frame splicing. Defaults to 1. dither float Amount of white-noise dithering. Defaults to 1e-5. stft_conv bool If set to true, uses pytorch_stft and convolutions. If set to false, uses torch.stft . Defaults to false.

If you would like to add spectrogram augmentation to your model, then you can include a spec_augment block. Within this block, you can specify parameters for time and frequency cuts for augmentation, as described by SpecAugment and Cutout.

Parameter Datatype Description Supported Values rect_masks int How many rectangular masks should be cut (Cutout). Defaults to 5. rect_freq int Should only be set if rect_masks was set. Maximum size of cut rectangles along the frequency dimension. Defaults to 5. rect_time int Should only be set if rect_masks was set. Maximum size of cut rectangles along the time dimension. Defaults to 25. freq_masks int How many frequency segments should be cut (SpecAugment). Defaults to 0. freq_width int Should only be set if freq_masks is set. Maximum number of frequencies to be cut in one segment. Defaults to 10. time_masks int How many time segments should be cut (SpecAugment). Defaults to 0. time_width int Should only be set if time_masks is set. Maximum number of time steps to be cut in one segment. Defaults to 10.

The encoder parameters for your model include details about the CitriNet encoder architecture, including how many blocks to use, how many times to repeat each block, and convolution parameters per block.

To use CitriNet (which uses squeeze-and-excitation(SE) blocks), add separable: true , se: true , and se_context_size: -1 to all the blocks in the architecture. (Note: do not change the parameter name jasper .)

The encoder parameters are detailed in the following table.

Parameter Datatype Description Supported Values feat_in int The number of input features. Should be equal to features in the preprocessor parameters. activation string What activation function to use in the encoder. hardtanh , relu , selu , swish conv_mask bool Whether to use masked convolutions in the encoder. Defaults to false. jasper A list of blocks that specifies your encoder architecture. Each entry in this list represents one block in the architecture and contains the parameters for that block, including convolution parameters, dropout, and the number of times the block is repeated.

The decoder parameters are detailed in the following table.