Speech Recognition With CitriNet ================================ .. _speech_recognition_citrinet: Automatic Speech Recognition (ASR) models take in audio files and predict their transcriptions. Besides Jasper and QuartzNet, we can also use CitriNet for ASR. CitriNet is a successor of QuartzNet that features on sub-word tokenization and better backbone architecture. Downloading Sample Spec Files ----------------------------- Example specification files for the following ASR tasks can be downloaded using the following command: .. code:: tlt speech_to_text_citrinet download_specs -o \ -r Required Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-o`: Target path where the spec files will be stored. * :code:`-r`: Results and output log directory. Preparing the Dataset --------------------- The dataset for ASR consists of a set of utterances in individual audio files (.wav) and a manifest that describes the dataset, with information about one utterance per line (.json). Each line of the manifest should be in the following format: .. code:: {"audio_filepath": "/path/to/audio.wav", "text": "the transcription of the utterance", "duration": 23.147} The :code:`audio_filepath` field should provide an absolute path to the .wav file corresponding to the utterance. The :code:`text` field should contain the full transcript for the utterance, and the :code:`duration` field should reflect the duration of the utterance in seconds. Each entry in the manifest (describing one audio file) should be bordered by '{' and '}' and must be contained on one line. The fields that describe the file should be separated by commas, and have the form :code:`"field_name": value`, as shown above. Since the manifest specifies the path for each utterance, the audio files do not have to be located in the same directory as the manifest, or even in any specific directory structure. Creating an Experiment Spec File -------------------------------- The spec file for ASR using CitriNet includes the :code:`trainer`, :code:`save_to`, :code:`model`, :code:`training_ds`, :code:`validation_ds`, and :code:`optim` parameters. The following is a shortened example of a spec file for training on the Mozilla Common Voice English dataset. .. code:: trainer: max_epochs: 100 # Name of the .tlt file where the trained CitriNet model will be saved save_to: trained-model.tlt # Specifies parameters for the ASR model model: # Parameters for sub-word tokenization tokenizer: dir: ??? type: "bpe" # Can be either bpe or wpe # Parameters for the audio to spectrogram preprocessor. preprocessor: _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor normalize: per_feature sample_rate: 16000 window_size: 0.02 window_stride: 0.01 window: hann features: 64 n_fft: 512 frame_splicing: 1 dither: 1.0e-05 stft_conv: false # This adds spectrogram augmentation to the training process. spec_augment: _target_: nemo.collections.asr.modules.SpectrogramAugmentation rect_masks: 5 rect_freq: 50 rect_time: 120 # The encoder and decoder sections specify your model architecture. encoder: _target_: nemo.collections.asr.modules.ConvASREncoder feat_in: 64 activation: relu conv_mask: true # Several blocks were cut out here for brevity. jasper: - filters: 128 repeat: 1 kernel: [11] stride: [1] dilation: [1] dropout: 0.0 residual: true separable: true se: true se_context_size: -1 #... (Add more blocks to describe the model) - filters: &enc_feat_out 1024 repeat: 1 kernel: [1] stride: [1] dilation: [1] dropout: 0.0 residual: false separable: true se: true se_context_size: -1 decoder: _target_: nemo.collections.asr.modules.ConvASRDecoder feat_in: 1024 num_classes: -1 # filled with vocabulary size from tokenizer at runtime vocabulary: [] # filled with vocabulary from tokenizer at runtime # This section specifies the dataset to be used for training. training_ds: # No need to specify an audio file path, since the manifest entries contain individual # utterances' file paths. manifest_filepath: /data/cv-corpus-5.1-2020-06-22/en/train.json sample_rate: 16000 batch_size: 32 trim_silence: true # Setting a max duration trims out files that are longer than the max. max_duration: 16.7 shuffle: true # The is_tarred and tarred_audio_filepaths parameters should be specified if using a tarred dataset. is_tarred: false tarred_audio_filepaths: null # Specifies the dataset to be used for validation. validation_ds: manifest_filepath: /data/cv-corpus-5.1-2020-06-22/en/dev.json sample_rate: 16000 batch_size: 32 shuffle: false # The parameters for the training optimizer, including learning rate, lr schedule, etc. optim: name: adam lr: .1 # optimizer arguments betas: [0.9, 0.999] weight_decay: 0.0001 # scheduler setup sched: name: CosineAnnealing # scheduler config override warmup_steps: null warmup_ratio: 0.05 min_lr: 1e-6 last_epoch: -1 The specification can be roughly grouped into three categories: * Parameters that describe the training process * Parameters that describe the datasets, and * Parameters that describe the model. This specification can be used with the :code:`tlt speech_to_text_citrinet train` command. Only a dataset parameter is required for :code:`tlt speech_to_text_citrinet evaluate`, though a checkpoint must be provided. If you would like to change a parameter for your run without changing the specification file itself, you can specify it on the command line directly. For example, if you would like to change the validation batch size, you can add :code:`validation_ds.batch_size=1` to your command, which would override the batch size of 32 in the configuration shown above. An example of this is shown in the training instructions below. Training Process Configs ^^^^^^^^^^^^^^^^^^^^^^^^ There are a few parameters that help specify the parameters of your training run, which are detailed in the following table. +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | **Parameter** | **Datatype** | **Description** | **Supported Values** | +=========================+==================+==========================================================================================================================+==============================+ | :code:`max_epochs` | int | Specifies the maximum number of epochs to train the model. A field for the :code:`trainer` parameter. | >0 | +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`save_to` | string | The location to save the trained model checkpoint. Should be in the form :code:`path/to/target/location/modelname.tlt`. | Valid paths. | +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`optim` | | Specifies the optimizer to be used for training, as well as the parameters to configure it, including: | | | | | | | | | | * :code:`name` (String): Which optimizer to use. | | | | | * :code:`lr` (float): The learning rate. Must be specified. | | | | | * :code:`sched`: Specifies learning rate schedule, if desired. | | | | | | | | | | If your chosen optimizer takes additional arguments, they can be placed under :code:`lr`, as seen in the example above | | | | | with :code:`betas` and :code:`weight_decay`. | | +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ Dataset Configs ^^^^^^^^^^^^^^^ The datasets that you use should be specified by :code:`_ds` parameters, depending on your use case: * For training using :code:`tlt speech_to_text_citrinet train`, you should have :code:`training_ds` to describe your training dataset, and :code:`validation_ds` to describe your validation dataset. * For evaluation using :code:`tlt speech_to_text_citrinet evaluate`, you should have :code:`test_ds` to describe your test dataset. * For fine-tuning using :code:`tlt speech_to_text_citrinet finetune`, you should have :code:`finetuning_ds` to describe the fine-tuning training dataset, and :code:`validation_ds` to describe your validation dataset. The fields for each dataset config are described in the following table. +---------------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | **Parameter** | **Datatype** | **Description** | **Supported Values** | +=================================+==================+==========================================================================================================================+==============================+ | :code:`manifest_filepath` | string | The filepath to the manifest (.json file) that describes the audio data. | Valid filepaths. | +---------------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`sample_rate` | int | Target sample rate to load the audio, in kHz. | | +---------------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`batch_size` | int | Batch size. This may depend on memory size and how long your audio samples are. | >0 | +---------------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`trim_silence` | bool | Whether or not to trim silence from the beginning and end of each audio signal. Defaults to false if no value is set. | True/False | +---------------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`min_duration` | float | All files with a duration less than the given value (in seconds) will be dropped. Defaults to 0.1. | | +---------------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`max_duration` | float | All files with a duration greater than the given value (in seconds) will be dropped. | | +---------------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`shuffle` | bool | Whether or not to shuffle the data. We recommend true for training data, and false for validation. | True/False | +---------------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`is_tarred` | bool | Whether the audio files in the dataset are contained in a tarfile (.tar). | True/False | | | | If so, you must also set :code:`tarred_audio_filepaths`, and set :code:`shuffle_n` if you would like the data to be | | | | | shuffled. Defaults to false. | | +---------------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`tarred_audio_filepaths` | string | Only to be set if :code:`is_tarred` is set to true. | | | | | Path to the tarfile (.tar) that contains the audio samples associated with the entries in :code:`manifest_filepath`. | Valid filepaths. | +---------------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`shuffle_n` | int | Only to be set if :code:`is_tarred` is set to true. | | | | | The number of audio samples to load at once from the tarfile for shuffling. For example, if set to 100 when batch size | | | | | is 25, the data loader will load the next 100 samples in the tarfile, shuffle them, and use the shuffled order for the | | | | | next four batches. | | +---------------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ Model Configs ^^^^^^^^^^^^^ Your CitriNet model architecture and configuration are set under the :code:`model` parameter. This includes parameters for tokenizer, which defines the tokenizer type and path for sub-word tokenization; parameters for the audio preprocessor, which determines how audio signals are transformed to spectrograms; spectrogram augmentation, which adds a data augmentation step to the pipeline; the encoder of the model; and the decoder of the model. The tokenizer parameters are as follows: +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | **Parameter** | **Datatype** | **Description** | **Supported Values** | +=========================+==================+==========================================================================================================================+==============================+ | :code:`dir` | string | Root path to the tokneizer model. This path is assumed to be created by the :code:`create_tokenizer` command. | Valid path. | +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`type` | string | Type of the tokenizer, either 'bpe' or 'wpe'. | | +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ The preprocessor parameters are as follows: +-------------------------+------------------+---------------------------------------------------------------------------------------------------------+----------------------------------------------------+ | **Parameter** | **Datatype** | **Description** | **Supported Values** | +=========================+==================+=========================================================================================================+====================================================+ | :code:`normalize` | string | How to normalize each spectrogram. Defaults to :code:`per_feature`. | * :code:`per_feature`: Normalizes each spectrogram | | | | | per channel/frequency. | | | | | * :code:`all_features`: Normalizes over the entire | | | | | spectrogram to be mean 0 with std 1. | | | | | * Any other value: Disables normalization. | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------+----------------------------------------------------+ | :code:`sample_rate` | int | Sample rate of the input audio data in kHz. This should match your datasets' sample rates. | | | | | Defaults to 16000. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------+----------------------------------------------------+ | :code:`window_size` | float | Window size for FFT in seconds. Defaults to 0.02. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------+----------------------------------------------------+ | :code:`window_stride` | float | Window stride for FFT in seconds. Defaults to 0.01. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------+----------------------------------------------------+ | :code:`window` | string | Windowing function for FFT. Defaults to :code:`hann`. | :code:`hann`, :code:`hamming`, :code:`blackman`, | | | | | :code:`bartlett` | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------+----------------------------------------------------+ | :code:`features` | int | Number of mel spectrogram frequency bins to output. Defaults to 64. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------+----------------------------------------------------+ | :code:`n_fft` | int | Length of FFT window. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------+----------------------------------------------------+ | :code:`frame_splicing` | int | How many frames to stack across the feature dimension. Setting this to 1 disables frame splicing. | | | | | Defaults to 1. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------+----------------------------------------------------+ | :code:`dither` | float | Amount of white-noise dithering. Defaults to 1e-5. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------+----------------------------------------------------+ | :code:`stft_conv` | bool | If set to true, uses :code:`pytorch_stft` and convolutions. If set to false, uses :code:`torch.stft`. | | | | | Defaults to false. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------+----------------------------------------------------+ If you would like to add spectrogram augmentation to your model, then you can include a :code:`spec_augment` block. Within this block, you can specify parameters for time and frequency cuts for augmentation, as described by `SpecAugment `_ and `Cutout `_. +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | **Parameter** | **Datatype** | **Description** | **Supported Values** | +=========================+==================+==========================================================================================================================+==============================+ | :code:`rect_masks` | int | How many rectangular masks should be cut (Cutout). Defaults to 5. | | +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`rect_freq` | int | Should only be set if :code:`rect_masks` was set. Maximum size of cut rectangles along the frequency dimension. | | | | | Defaults to 5. | | +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`rect_time` | int | Should only be set if :code:`rect_masks` was set. Maximum size of cut rectangles along the time dimension. | | | | | Defaults to 25. | | +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`freq_masks` | int | How many frequency segments should be cut (SpecAugment). Defaults to 0. | | +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`freq_width` | int | Should only be set if :code:`freq_masks` is set. Maximum number of frequencies to be cut in one segment. Defaults to 10. | | +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`time_masks` | int | How many time segments should be cut (SpecAugment). Defaults to 0. | | +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | :code:`time_width` | int | Should only be set if :code:`time_masks` is set. Maximum number of time steps to be cut in one segment. Defaults to 10. | | +-------------------------+------------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ The encoder parameters for your model include details about the CitriNet encoder architecture, including how many blocks to use, how many times to repeat each block, and convolution parameters per block. To use CitriNet (which uses squeeze-and-excitation(SE) blocks), add :code:`separable: true`, :code:`se: true`, and :code:`se_context_size: -1` to all the blocks in the architecture. (Note: do not change the parameter name :code:`jasper`.) The encoder parameters are detailed in the following table. +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | **Parameter** | **Datatype** | **Description** | **Supported Values** | +=========================+==================+===============================================================================================================+=================================+ | :code:`feat_in` | int | The number of input features. Should be equal to :code:`features` in the preprocessor parameters. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | :code:`activation` | string | What activation function to use in the encoder. | :code:`hardtanh`, :code:`relu`, | | | | | :code:`selu`, :code:`swish` | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | :code:`conv_mask` | bool | Whether to use masked convolutions in the encoder. Defaults to false. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | :code:`jasper` | | A list of blocks that specifies your encoder architecture. Each entry in this list represents one block in | | | | | the architecture and contains the parameters for that block, including convolution parameters, dropout, and | | | | | the number of times the block is repeated. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ The decoder parameters are detailed in the following table. +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | **Parameter** | **Datatype** | **Description** | **Supported Values** | +=========================+==================+===============================================================================================================+=================================+ | :code:`feat_in` | int | The number of input features to the decoder. Should be equal to the number of filters in the last block of | | | | | the encoder. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | :code:`vocabulary` | list | A list of the valid output characters for your model. For CitriNet, this should always be an empty list. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | :code:`num_classes` | int | Number of output classes. For CitriNet this should always be set to -1. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ Subword Tokenization with the Tokenizer --------------------------------------- Before we can do the actual training, we need to do some processings to the text. This step is called subword tokenization that creates a subword vocabulary for the text. This is different from Jasper/QuartzNet because only single characters are regarded as elements in the vacabulary in their cases, while in CitriNet the subword can be one or multiple characters. We can use the :code:`create_tokenizer` command to create the tokenizer that can generate the subword vocabulary for us for use in training below. As mentioned above, you can add additional arguments to override configurations from your experiment specification file. This allows you to create valid spec files that leave these fields blank, to be specified as command line arguments at runtime. .. code:: tlt speech_to_text_citrinet create_tokenizer \ -e \ manifests= \ output_root= \ vocab_size= Required Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-e`: The experiment specification file to set up tokenizer, described in detail below. Creating a config file for tokenizer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The command :code:`create_tokenizer` requires a config file. This config file is also in yaml format. It contains :code:`manifests`, :code:`output_root`, :code:`vocab_size`, and :code:`tokenizer` parameters in it, as in table below. +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | **Parameter** | **Datatype** | **Description** | **Supported Values** | +=========================+==================+===============================================================================================================+=================================+ | :code:`manifests` | string | Comma separated list of manifest file paths, can be one or more. The manifest file should be the same as | | | | | one used in training. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | :code:`output_root` | string | The output root path for the tokenizer model to be generated. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | :code:`vocab_size` | int | The vocaburary size of the tokenizer. | | +-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ The :code:`tokenizer` parameter has its own sub-parameters, as in table below. +-------------------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | **Parameter** | **Datatype** | **Description** | **Supported Values** | +=====================================+==================+===============================================================================================================+=================================+ | :code:`tokenizer_type` | string | Type of the tokenizer, currently :code:`'spe'` and :code:`'wpe'` are supported. | :code:`'spe'` or :code:`'wpe'` | +-------------------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | :code:`spe_type` | string | Sub-type of 'spe' tokenizer. Valid only when :code:`tokenizer_type` is set to :code:`'spe'`. | unigram, bpe, char, word | +-------------------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | :code:`spe_character_coverage` | float | How much of the original vocabulary it should cover in its "base set" of tokens. | <=1 | +-------------------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ | :code:`lower_case` | bool | Create separate tokens for upper and lower case characters or not. Set it to True will not create separated | | | | | tokens for upper case and lower case characters. | | +-------------------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ Training the Model ------------------ To train a model from scratch, use the following command: .. code:: tlt speech_to_text_citrinet train -e -g As mentioned above, you can add additional arguments to override configurations from your experiment specification file. This allows you to create valid spec files that leave these fields blank, to be specified as command line arguments at runtime. For example, the following command can be used to override the training manifest and validation manifest, the number of epochs to train, and the place to save the model checkpoint: .. code:: tlt speech_to_text_citrinet train -e \ -g \ training_ds.manifest_filepath= \ validation_ds.manifest_filepath= \ trainer.max_epochs= \ save_to='.tlt' Required Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-e`: The experiment specification file to set up training, as in the example given above. Optional Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-g`: The number of GPUs to be used in the training in a multi-GPU scenario (default: 1). * :code:`-r`: The path to the results and log directory. Log files, checkpoints, etc., will be stored here. * :code:`-k`: The key to encrypt the model. * Other arguments to override fields in the specification file. Training Procedure ^^^^^^^^^^^^^^^^^^ At the start of each training experiment, TLT will print out a log of the experiment specification, including any parameters added or overridden via the command line. It will also show additional information, such as which GPUs are available, where logs will be saved, how many hours are in each loaded dataset, and how much of each dataset has been filtered. .. code:: GPU available: True, used: True TPU available: None, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2] [NeMo W 2021-01-20 21:41:53 exp_manager:375] Exp_manager is logging to ./, but it already exists. [NeMo I 2021-01-20 21:41:53 exp_manager:323] Resuming from checkpoints/trained-model-last.ckpt [NeMo I 2021-01-20 21:41:53 exp_manager:186] Experiments will be logged at . ... [NeMo I 2021-01-20 21:41:54 collections:173] Dataset loaded with 948 files totaling 0.71 hours [NeMo I 2021-01-20 21:41:54 collections:174] 0 files were filtered totaling 0.00 hours [NeMo I 2021-01-20 21:41:54 collections:173] Dataset loaded with 130 files totaling 0.10 hours [NeMo I 2021-01-20 21:41:54 collections:174] 0 files were filtered totaling 0.00 hours You should next see a full printout of the number of parameters in each module and submodule, as well as the total number of trainable and non-trainable parameters in the model. In the following table, the `encoder` module contains 18.9 million parameters, and its submodule `encoder.encoder.0`, the Jasper block, contains 19,000 of those parameters. Of those 19,000 parameters, 2.1k are from the first `MaskedConv1d`, 16.4k are from the second, and 512 are from `BatchNorm1d`. Also listed are the ReLU and dropout submodules, with no parameters. .. code:: | Name | Type | Params ----------------------------------------------------------------------------------------- 0 | preprocessor | AudioToMelSpectrogramPreprocessor | 0 1 | preprocessor.featurizer | FilterbankFeatures | 0 2 | encoder | ConvASREncoder | 18.9 M 3 | encoder.encoder | Sequential | 18.9 M 4 | encoder.encoder.0 | JasperBlock | 19.0 K 5 | encoder.encoder.0.mconv | ModuleList | 19.0 K 6 | encoder.encoder.0.mconv.0 | MaskedConv1d | 2.1 K 7 | encoder.encoder.0.mconv.0.conv | Conv1d | 2.1 K 8 | encoder.encoder.0.mconv.1 | MaskedConv1d | 16.4 K 9 | encoder.encoder.0.mconv.1.conv | Conv1d | 16.4 K 10 | encoder.encoder.0.mconv.2 | BatchNorm1d | 512 11 | encoder.encoder.0.mout | Sequential | 0 12 | encoder.encoder.0.mout.0 | ReLU | 0 13 | encoder.encoder.0.mout.1 | Dropout | 0 ... 600 | decoder | ConvASRDecoder | 29.7 K 601 | decoder.decoder_layers | Sequential | 29.7 K 602 | decoder.decoder_layers.0 | Conv1d | 29.7 K 603 | loss | CTCLoss | 0 604 | spec_augmentation | SpectrogramAugmentation | 0 605 | spec_augmentation.spec_cutout | SpecCutout | 0 606 | _wer | WER | 0 ----------------------------------------------------------------------------------------- 18.9 M Trainable params 0 Non-trainable params 18.9 M Total params As the model starts training, you should see a progress bar per epoch. .. code:: Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.40it/s, loss=62.5Epoch 0, global step 29: val_loss reached 307.90469 (best 307.90469), saving model to "/tlt-nemo/checkpoints/trained-model---val_loss=307.90-epoch=0.ckpt" as top 3 Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.48it/s, loss=57.6Epoch 1, global step 59: val_loss reached 70.93443 (best 70.93443), saving model to "/tlt-nemo/checkpoints/trained-model---val_loss=70.93-epoch=1.ckpt" as top 3 Epoch 2: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.42it/s, loss=55.5Epoch 2, global step 89: val_loss reached 465.39551 (best 70.93443), saving model to "/tlt-nemo/checkpoints/trained-model---val_loss=465.40-epoch=2.ckpt" as top 3 Epoch 3: 60%|█████████████████████████████████████████████████████▍ | 21/35 [00:09<00:06, 2.19it/s, loss=54.5] ... At the end of training, TLT will save the last checkpoint at the path specified by the experiment spec file before finishing. .. code:: [NeMo I 2021-01-20 22:38:48 train:120] Experiment logs saved to '.' [NeMo I 2021-01-20 22:38:48 train:123] Trained model saved to './checkpoints/trained-model.tlt' INFO: Internal process exited Troubleshooting ^^^^^^^^^^^^^^^ * Currently, only .wav audio files are supported. * If you are training on a non-English dataset and are consistently getting blank predictions, check that you have turned `normalize_transcripts=False`. By default, the data layers have normalization on and will get rid of non-English characters. * Similarly, if you are training on an English dataset with capital letters or additional punctuation, make sure that the data layer normalizes transcripts to lowercase, or that your custom vocabulary includes the additional valid characters. * If you consistently run into out-of-memory errors while training, consider adding a maximum length to your audio samples using `max_duration`. Evaluating the Model -------------------- To run evaluation on a trained model checkpoint, use this command: .. code:: tlt speech_to_text_citrinet evaluate -e \ -m \ -g Required Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-e`: The experiment specification file to set up evaluation. This only needs a dataset config, as described in the "Dataset Configs" section. * :code:`-m`: Path to the model checkpoint, which should be a :code:`.tlt` file. Optional Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-g`: The number of GPUs to be used in evaluation in a multi-GPU scenario (default: 1). * :code:`-r`: The path to the results and log directory. Log files, checkpoints, etc., will be stored here. * :code:`-k`: The key to encrypt the model. * Other arguments to override fields in the specification file. Evaluation Procedure ^^^^^^^^^^^^^^^^^^^^ At the start of evaluation, TLT will print out a log of the experiment specification, including any parameters added or overridden via the command line. It will also show additional information, such as which GPUs are available, where logs will be saved, and how many hours are in the test dataset. .. code:: GPU available: True, used: True TPU available: None, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2] [NeMo W 2021-01-20 22:58:19 exp_manager:375] Exp_manager is logging to ./, but it already exists. [NeMo I 2021-01-20 22:58:19 exp_manager:186] Experiments will be logged at . ... [NeMo I 2021-01-20 22:58:20 features:235] PADDING: 16 [NeMo I 2021-01-20 22:58:20 features:251] STFT using torch [NeMo I 2021-01-20 22:58:22 collections:173] Dataset loaded with 130 files totaling 0.10 hours [NeMo I 2021-01-20 22:58:22 collections:174] 0 files were filtered totaling 0.00 hours Once evaluation begins, a progress bar will be shown to indicate how many batches have been processed. After evaluation, the test results will be shown. .. code:: Testing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 2.43it/s] -------------------------------------------------------------------------------- DATALOADER:0 TEST RESULTS {'test_loss': tensor(68.1998, device='cuda:0'), 'test_wer': tensor(0.9987, device='cuda:0')} -------------------------------------------------------------------------------- [NeMo I 2021-01-20 22:58:24 evaluate:94] Experiment logs saved to '.' INFO: Internal process exited Troubleshooting ^^^^^^^^^^^^^^^ * Currently, only .wav audio files are supported. * Filtering should be turned off for evaluation. Make sure that `max_duration` and `min_duration` are not set. * For best results, evaluation should be done on audio files with the same sample rate as the training data. * The model will only predict characters that were included in the training vocabulary. Make sure that the training and test vocabularies match, including normalization. Fine-Tuning the Model --------------------- To fine-tune a model from a checkpoint, use the following command: .. code:: tlt speech_to_text_citrinet finetune -e \ -m \ -g Required Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-e`: The experiment specification file to set up fine-tuning. This requires the :code:`trainer`, :code:`save_to`, and :code:`optim` configurations described in the "Training Process Configs" section, as well as :code:`finetuning_ds` and :code:`validation_ds` configs, as described in the "Dataset Configs" section. Additionally, if your fine-tuning dataset has a different vocabulary (i.e., set of labels) than the trained model checkpoint, you must also set :code:`change_vocabulary: true` at the top level of your specification file. * :code:`-m`: Path to the model checkpoint from which to fine-tune. Should be a :code:`.tlt` file. Optional Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-g`: The number of GPUs to be used for fine-tuning in a multi-GPU scenario (default: 1). * :code:`-r`: The path to the results and log directory. Log files, checkpoints, etc., will be stored here. * :code:`-k`: The key to encrypt the model. * Other arguments to override fields in the specification file. Fine-Tuning Procedure ^^^^^^^^^^^^^^^^^^^^^ At the start of fine-tuning, TLT will print out a log of the experiment specification, including any parameters added or overridden via the command line. It will also show additional information, such as which GPUs are available, where logs will be saved, and how many hours are in the fine-tuning and evaluation dataset. When restoring from the checkpoint, it will then log the original datasets that the checkpoint model was trained and evaluated on. If the vocabulary has been changed, the logs will show what the new vocabulary is. .. code:: [NeMo I 2021-01-20 23:33:12 finetune:110] Model restored from './checkpoints/trained-model.tlt' [NeMo I 2021-01-20 23:33:12 ctc_models:247] Changed decoder to output to [' ', 'а', 'б', 'в', 'г', 'д', 'е', 'ё', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я'] vocabulary. [NeMo I 2021-01-20 23:33:12 collections:173] Dataset loaded with 7242 files totaling 11.74 hours [NeMo I 2021-01-20 23:33:12 collections:174] 0 files were filtered totaling 0.00 hours [NeMo I 2021-01-20 23:33:12 collections:173] Dataset loaded with 7307 files totaling 12.56 hours [NeMo I 2021-01-20 23:33:12 collections:174] 0 files were filtered totaling 0.00 hours As with training, TLT will log a full listing of the modules and submodules in the model, as well as the total number of trainable and non-trainable parameters in the model. See the Training section for more details on the parameter breakdowns. .. code:: | Name | Type | Params ----------------------------------------------------------------------------------------- 0 | preprocessor | AudioToMelSpectrogramPreprocessor | 0 1 | preprocessor.featurizer | FilterbankFeatures | 0 2 | encoder | ConvASREncoder | 18.9 M ... 603 | decoder | ConvASRDecoder | 35.9 K 604 | decoder.decoder_layers | Sequential | 35.9 K 605 | decoder.decoder_layers.0 | Conv1d | 35.9 K 606 | loss | CTCLoss | 0 ----------------------------------------------------------------------------------------- 18.9 M Trainable params 0 Non-trainable params 18.9 M Total params Note that if the vocabulary has been changed, the decoder may have a different number of parameters. Fine-tuning on the new dataset should proceed afterwards as with normal training, with a progress bar per epoch and checkpoints saved to the specified directory. Troubleshooting ^^^^^^^^^^^^^^^ * Currently, only .wav audio files are supported. * We recommend using a lower learning rate for fine-tuning from a trained model checkpoint. A good criteria to start with is 1/100 of the original learning rate. * If the fine-tuning vocabulary is different from the original training vocabulary, you will need to set `change_vocabulary=True`. * You may see a dimensionality mismatch error (example below) or other hyperparameter mismatch error if your training checkpoint directory (i.e., the model you are loading with `restore_from`) and fine-tuning checkpoint directory are the same. Make sure they are distinct by using the `-r` flag (e.g., `-r `). .. code:: RuntimeError: Error(s) in loading state_dict for EncDecCTCModel: `size mismatch for decoder.decoder_layers.0.weight: copying a param with shape torch.Size([29, 1024, 1]) from checkpoint, the shape in current model is torch.Size([35, 1024, 1]).` Using Inference on a Model -------------------------- To perform inference on individual audio files, use the following command: .. code:: tlt speech_to_text_citrinet infer -e \ -m \ -g Required Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-e`: The experiment specification file to set up inference. This spec file only needs a :code:`file_paths` parameter that contains a list of individual file paths. * :code:`-m`: Path to the model checkpoint. Should be a :code:`.tlt` file. Optional Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-g`: The number of GPUs to be used for inference in a multi-GPU scenario (default: 1). * :code:`-r`: The path to the results and log directory. Log files, checkpoints, etc., will be stored here. * :code:`-k`: The key to encrypt the model. * Other arguments to override fields in the specification file. Inference Procedure ^^^^^^^^^^^^^^^^^^^ At the start of inference, TLT will print out the experiment specification, including the audio filepaths on which inference will be performed. When restoring from the checkpoint, it will then log the original datasets that the checkpoint model was trained and evaluated on. This will show the vocabulary that the model was trained on. .. code:: Train config : manifest_filepath: /data/an4/train_manifest.json batch_size: 32 sample_rate: 16000 labels: - ' ' - a - b - c ... Prediction results will be shown at the end of the log. Each prediction is preceded by the associated filename on the previous line. .. code:: [NeMo I 2021-01-21 00:22:00 infer:67] The prediction results: [NeMo I 2021-01-21 00:22:00 infer:69] File: /data/an4/wav/an4test_clstk/fcaw/an406-fcaw-b.wav [NeMo I 2021-01-21 00:22:00 infer:70] Predicted transcript: rubout g m e f three nine [NeMo I 2021-01-21 00:22:00 infer:69] File: /data/an4/wav/an4test_clstk/fcaw/an407-fcaw-b.wav [NeMo I 2021-01-21 00:22:00 infer:70] Predicted transcript: erase c q q f seven [NeMo I 2021-01-21 00:22:00 infer:73] Experiment logs saved to '.' INFO: Internal process exited Troubleshooting ^^^^^^^^^^^^^^^ * Currently, only .wav audio files are supported. * For best results, inference should be done on audio files with the same sample rate as the training data. * The model will only predict characters that were included in the training vocabulary. Make sure that the training and test vocabularies match, including normalization. Model Export ------------ You can export a trained ASR model to the Jarvis format, which contains all the model artifacts necessary for deployment to Jarvis Services. For more details about Jarvis, see `this page `__. To export an ASR model to the Jarvis format, use the following command: .. code:: tlt speech_to_text_citrinet export -e \ -m \ -r \ -k \ export_format=JARVIS Required Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-e`: The experiment specification file for export. See the Export Spec File section below. * :code:`-m`: Path to the model checkpoint to be exported. Should be a :code:`.tlt` file. Optional Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-k`: The encryption key. * :code:`-r`: Path to the directory where results will be stored. Export Spec File ^^^^^^^^^^^^^^^^ The following is an example spec file for model export. .. code:: # Name of the .tlt EFF archive to be loaded/model to be exported. restore_from: trained-model.tlt # Set export format: JARVIS export_format: JARVIS # Output EFF archive containing model checkpoint and artifacts required for Jarvis Services export_to: exported-model.ejrvs +-------------------------+------------------+--------------------------------------------------+---------------------------------+ | **Parameter** | **Datatype** | **Description** | **Default** | +=========================+==================+==================================================+=================================+ | :code:`restore_from` | string | Path to the pre-trained model to be exported. | :code:`trained_model.tlt` | +-------------------------+------------------+--------------------------------------------------+---------------------------------+ | :code:`export_format` | string | Export format. | N/A | +-------------------------+------------------+--------------------------------------------------+---------------------------------+ | :code:`export_to` | string | Target path for the export model. | :code:`exported-model.ejrvs` | +-------------------------+------------------+--------------------------------------------------+---------------------------------+