Speech Recognition With Conformer
Automatic Speech Recognition (ASR) models take in audio files and predict their transcriptions. Besides Jasper, QuartzNet and CitriNet, you can also use Conformer for ASR. Conformer is a combination of self-attention and convolution modules to achieve the best of the two approaches.
Example specification files for the following ASR tasks can be downloaded using this command:
tao speech_to_text_conformer download_specs -o <target_path> \
-r <results_path>
Required Arguments
-o
: Target path where the spec files will be stored-r
: Results and output log directory
The dataset for ASR consists of a set of utterances in individual .wav
audio files and a
.json
manifest that describes the dataset, with information about a single utterance per line.
Each line of the manifest should be in the following format:
{"audio_filepath": "/path/to/audio.wav", "text": "the transcription of the utterance", "duration": 23.147}
The audio_filepath
field should provide an absolute path to the .wav
file corresponding
to the utterance. The text
field should contain the full transcript for the utterance, and
the duration
field should reflect the duration of the utterance in seconds.
Each entry in the manifest (describing one audio file) must be encompassed by { }
and contained
on a single line. The fields that describe the file should be separated by commas, and have the form "field_name": value
,
as shown above.
Since the manifest specifies the path for each utterance, the audio files do not have to be located in the same directory as the manifest, or even in any specific directory structure.
The spec file for ASR using Conformer includes the trainer
, save_to
, model
,
training_ds
, validation_ds
, and optim
parameters. The following is a
shortened example of a spec file for training on the Mozilla Common Voice English dataset.
trainer:
max_epochs: 100
tlt_checkpoint_interval: 1
# Name of the .tlt file where the trained Conformer model will be saved
save_to: trained-model.tlt
# Specifies parameters for the ASR model
model:
log_prediction: true # enables logging sample predictions in the output during training
ctc_reduction: 'mean_batch'
# Parameters for sub-word tokenization
tokenizer:
dir: ???
type: "bpe" # Can be either bpe or wpe
# Parameters for the audio to spectrogram preprocessor.
preprocessor:
_target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
sample_rate: 16000
normalize: per_feature
window_size: 0.025
window_stride: 0.01
window: hann
features: 80
n_fft: 512
frame_splicing: 1
dither: 1.0e-05
pad_to: 0
pad_value: 0.0
# This adds spectrogram augmentation to the training process.
spec_augment:
_target_: nemo.collections.asr.modules.SpectrogramAugmentation
freq_masks: 2 # set to zero to disable it
# you may use lower time_masks for smaller models to have a faster convergence
time_masks: 5 # set to zero to disable it
freq_width: 27
time_width: 0.05
# The encoder and decoder sections specify your model architecture.
encoder:
_target_: nemo.collections.asr.modules.ConformerEncoder
feat_in: 80
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 16
d_model: 176
# Sub-sampling params
subsampling: striding # vggnet or striding, vggnet may give better results but needs more memory
subsampling_factor: 4 # must be power of 2
subsampling_conv_channels: -1 # -1 sets it to d_model
# Feed forward module's params
ff_expansion_factor: 4
# Multi-headed Attention Module's params
self_attention_model: rel_pos # rel_pos or abs_pos
n_heads: 4 # may need to be lower for smaller d_models
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
xscaling: true # scales up the input embeddings by sqrt(d_model)
untie_biases: true # unties the biases of the TransformerXL layers
pos_emb_max_len: 5000
# Convolution module's params
conv_kernel_size: 31
conv_norm_type: 'batch_norm' # batch_norm or layer_norm
### regularization
dropout: 0.1 # The dropout used in most of the Conformer Modules
dropout_emb: 0.0 # The dropout used for embeddings
dropout_att: 0.1 # The dropout for multi-headed attention modules
decoder:
_target_: nemo.collections.asr.modules.ConvASRDecoder
feat_in: null
num_classes: -1 # filled with vocabulary size from tokenizer at runtime
vocabulary: [] # filled with vocabulary from tokenizer at runtime
# This section specifies the dataset to be used for training.
training_ds:
# No need to specify an audio file path, since the manifest entries contain individual
# utterances' file paths.
manifest_filepath: /data/cv-corpus-5.1-2020-06-22/en/train.json
sample_rate: 16000
batch_size: 32
shuffle: true
use_start_end_token: false
trim_silence: false
# Setting a max duration trims out files that are longer than the max.
max_duration: 16.7 # it is set for LibriSpeech, you may need to update it for your dataset
min_duration: 0.1
# The is_tarred and tarred_audio_filepaths parameters should be specified if using a tarred dataset.
is_tarred: false
tarred_audio_filepaths: null
# bucketing params
bucketing_strategy: "synced_randomized"
bucketing_batch_size: null
# Specifies the dataset to be used for validation.
validation_ds:
manifest_filepath: /data/cv-corpus-5.1-2020-06-22/en/dev.json
sample_rate: 16000
batch_size: 32
shuffle: false
num_workers: 8
pin_memory: true
use_start_end_token: false
# The parameters for the training optimizer, including learning rate, lr schedule, etc.
optim:
name: adamw
lr: 5.0
# optimizer arguments
betas: [0.9, 0.98]
# less necessity for weight_decay as we already have large augmentations with SpecAug
# you may need weight_decay for large models, stable AMP training, small datasets, or when lower augmentations are used
# weight decay of 0.0 with lr of 2.0 also works fine
weight_decay: 1e-3
# scheduler setup
sched:
name: NoamAnnealing
d_model: ${model.encoder.d_model}
# scheduler config override
warmup_steps: 10000
warmup_ratio: null
min_lr: 1e-6
The specification can be grouped into roughly three categories:
Parameters that describe the training process
Parameters that describe the datasets
Parameters that describe the model
This specification can be used with the tao speech_to_text_conformer train
command.
Only a dataset parameter is required for tao speech_to_text_conformer evaluate
,
though a checkpoint must be provided.
If you would like to change a parameter for your run without changing the specification file itself,
you can specify it on the command line directly.
For example, to change the validation batch size, add
validation_ds.batch_size=1
to your command, which will override the batch size of 32 in
the configuration shown above.
An example of this is shown in the training instructions below.
Training Process Configs
There are a few parameters that help specify the parameters of your training run, which are detailed in the following table.
Parameter |
Datatype |
Description |
Supported Values |
---|---|---|---|
|
int |
Specifies the maximum number of epochs to train the model. A field for the |
>0 |
|
string |
The location to save the trained model checkpoint. THis should be in the form |
A valid path |
|
Specifies the optimizer to be used for training, as well as the parameters to configure it:
If your chosen optimizer takes additional arguments, they can be placed under |
||
|
int |
The interval (number of epochs) at which to save the |
>=0 (0 means no checkpoint) |
There is also a early_stopping
config, which enables early stopping during training.
It has the following parameters.
Parameter |
Datatype |
Description |
Supported Values |
---|---|---|---|
|
string |
The metric to monitor in order to enable early stopping. |
|
|
int |
The number of checks of the |
Positive integer |
|
float |
The delta of the minimum value of the |
Positive float |
Dataset Configs
The datasets that you use should be specified by <xyz>_ds
parameters, depending on your use case:
For training using
tao speech_to_text_conformer train
, usetraining_ds
to describe the training dataset andvalidation_ds
to describe the validation dataset.For evaluation using
tao speech_to_text_conformer evaluate
, usetest_ds
to describe the test dataset.For fine-tuning using
tao speech_to_text_conformer finetune
, usefinetuning_ds
to describe the fine-tuning training dataset andvalidation_ds
to describe the validation dataset.
The fields for each dataset config are described in the following table.
Parameter |
Datatype |
Description |
Supported Values |
---|---|---|---|
|
string |
The filepath to the manifest ( |
A valid filepath |
|
int |
The target sample rate (in kHz)to load the audio |
|
|
int |
The batch size. This may depend on memory size and how long the audio samples are. |
>0 |
|
bool |
Specifies whether or not to trim silence from the beginning and end of each audio signal. The default value is False. |
True/False |
|
float |
All files with a duration less than the given value (in seconds) will be dropped. The default value is 0.1. |
|
|
float |
All files with a duration greater than the given value (in seconds) will be dropped. |
|
|
bool |
Specifies whether or not to shuffle the data. We recommend using True for training data and False for validation. |
True/False |
|
bool |
Specifies whether or not to to add [BOS] and [EOS] tokens to the beginning and end of speech respectively |
True/False |
|
bool |
Specifies whether the audio files in the dataset are contained in a tarfile ( |
True/False |
|
string |
The path to the tarfile ( |
A valid filepath |
|
int |
The number of audio samples to load at once from the tarfile for shuffling. For example, if set to 100 when batch size is 25, the data loader will load the next 100 samples in the tarfile, shuffle them, and use the shuffled order for the next four batches. |
|
|
string |
Enables bucketing during training if specified. Only set this parameter if
* fixed_order: The same order of buckets is used for all epochs |
synced_randomized fully_randomized |
|
int |
The number of audio samples in each bucket. Only set this parameter if |
Model Configs
The Conformer model architecture and configuration are set under the model
parameter.
This includes general parameters, including the following:
Logging
Parameters for tokenizer, which defines the tokenizer type and path for sub-word tokenization
Parameters for the audio preprocessor, which determines how audio signals are transformed to spectrograms
Spectrogram augmentation, which adds a data augmentation step to the pipeline
The encoder of the model
The decoder of the model
Parameter |
Datatype |
Description |
Supported Values |
---|---|---|---|
|
string |
Whether a random sample should be printed in the output at each step, along with its predicted transcript. |
A valid path |
|
string |
The reduction type of CTC loss. The default setting is |
The tokenizer parameters are as follows:
Parameter |
Datatype |
Description |
Supported Values |
---|---|---|---|
|
string |
The root path to the tokneizer model. This path is presumably created by the |
A valid path |
|
string |
The tokenizer type, which can be either “bpe” or “wpe”. |
The preprocessor parameters are as follows:
Parameter |
Datatype |
Description |
Supported Values |
---|---|---|---|
|
string |
The normalization process for each spectrogram. Defaults to |
|
|
int |
The sample rate of the input audio data in kHz. This should match the sample rates of your datasets. The default value is 16000. |
|
|
float |
The window size for FFT in seconds. The default value is 0.02. |
|
|
float |
The window stride for FFT in seconds. The default value is 0.01. |
|
|
string |
The windowing function for FFT. The default value is |
|
|
int |
The number of mel spectrogram frequency bins to output. The default value is 64. |
|
|
int |
The length of the FFT window. |
|
|
int |
The number of frames to stack across the feature dimension. Setting this to 1 disables frame splicing. The default value is 1. |
|
|
float |
The amount of white-noise dithering. The default value is 1e-5. |
|
|
int |
Ensures that the output size of the time dimension is a multiple of pad_to. The default value is 16. |
|
|
float |
The value that shorter mels are padded with. The default value is 0. |
|
|
bool |
If set to True, uses |
If you wish to add spectrogram augmentation to your model, include a spec_augment
block.
Within this block, you can specify parameters for time and frequency cuts for augmentation, as described by
SpecAugment and Cutout.
Parameter |
Datatype |
Description |
Supported Values |
---|---|---|---|
|
int |
The number of rectangular masks to cut (Cutout). The default value is 5. |
|
|
int |
The maximum size of cut rectangles along the frequency dimension. This parameter should only be set if
|
|
|
int |
The maximum size of cut rectangles along the time dimension. This parameter should only be set if |
|
|
int |
The number of frequency segments to cut (SpecAugment). The default value is 0. |
|
|
int int |
The maximum number of frequencies to cut in one segment. This parameter should only be set if |
|
|
int |
The number of time segments to cut (SpecAugment). The default value is 0. |
|
|
int int |
The maximum number of time steps to cut in one segment. This parameter should only be set if |
The encoder parameters for the model include details about the Conformer encoder architecture, including how many blocks to use and how many times to repeat each block.
The encoder parameters are detailed in the following table.
Parameter |
Datatype |
Description |
Supported Values |
---|---|---|---|
|
int |
The number of input features. This value should be equal to |
|
|
int |
The number of layers of ConformerBlock |
|
|
int |
The hidden size of the model |
|
|
int |
The size of the output features. The default value is -1, which sets it to |
|
|
string |
The method of subsampling. The default value is striding. |
vggnet/striding |
|
int |
The subsampling factor, which should be power of 2. The default value is 4. |
|
|
int |
The size of the convolutions in the subsampling module. The default value is -1, which sets it to |
|
|
int |
The expansion factor in feed forward layers. The default value is 4. |
|
|
string |
Type of the attention layer and positional encoding. The default setting is
* |
rel_pos/abs_pos |
|
int |
The number of heads in multi-headed attention layers. The default value is 4. |
|
|
bool |
If True, scales the inputs to the multi-headed attention layers by sqrt( |
|
|
bool |
If True, shares (unties) the bias weights between layers of Transformer-XL. The default setting is True. |
|
|
int |
The maximum length of positional embeddings. The default value is 5000. |
|
|
int |
The size of the convolutions in the convolutional modules. The default value is 31. |
|
|
string |
The type of the normalization in the convolutional modules. The default value is ‘batch_norm’. |
|
|
float |
The dropout rate used in all layers except the attention layers. The default value is 0.1. |
|
|
float |
The dropout rate used for the positional embeddings. The default value is 0.1. |
|
|
float |
The dropout rate used for the attention layer. The default value is 0.0. |
The decoder parameters are detailed in the following table.
Parameter |
Datatype |
Description |
Supported Values |
---|---|---|---|
|
int |
The number of input features for the decoder. Should be equal to the number of filters in the last block of the encoder. |
|
|
list |
A list of the valid output characters for your model. For Conformer, this should always be an empty list. |
|
|
int |
The number of output classes. For Conformer, this should always be set to -1. |
Before performing the actual training, you need to process the text.
This step is called subword tokenization, creating a subword vocabulary for the text.
This is different from Jasper/QuartzNet, which only regard single characters as
elements in the vacabulary, while in Conformer the subword can be one or
multiple characters. You can use the create_tokenizer
command to create the
tokenizer to generate the subword vocabulary for use in training below.
As mentioned above, you can add additional arguments to override configurations from your experiment specification file. This allows you to create valid spec files that leave these fields blank, to be specified as command-line arguments at runtime.
tao speech_to_text_conformer create_tokenizer \
-e <experiment_spec> \
manifests=<manifest_file_paths> \
output_root=<output_dir> \
vocab_size=<vocaburary_size>
Required Arguments
-e
: The experiment specification file to set up the tokenizer, described in detail below.
Creating a config file for the Tokenizer
The command create_tokenizer
requires a config file in .yaml
format. It contains
manifests
, output_root
, vocab_size
, and tokenizer
parameters in it, as described in thetable below.
Parameter |
Datatype |
Description |
Supported Values |
---|---|---|---|
|
string |
A comma-separated list of one or more manifest file paths. The manifest file should be the same as the one used in training. |
|
|
string |
The output root path for the tokenizer model to be generated. |
|
|
int |
The vocaburary size of the tokenizer. |
The tokenizer
parameter has its own sub-parameters, as in table below.
Parameter |
Datatype |
Description |
Supported Values |
---|---|---|---|
|
string |
The type of tokenizer. Currently, |
|
|
string |
The sub-type of the ‘spe’ tokenizer. This parameter is only valid when |
unigram, bpe, char, word |
|
float |
The proportion of the original vocabulary that the tokenizer should cover in its “base set” of tokens. |
<=1 |
|
bool |
If True, the tokenizer will not create separate tokens for upper- and lower-case characters. |
To train a model from scratch, use the following command:
tao speech_to_text_conformer train -e <experiment_spec> -g <num_gpus>
As mentioned above, you can add additional arguments to override configurations from your experiment specification file. This allows you to create valid spec files that leave these fields blank, to be specified as command line arguments at runtime.
For example, the following command can be used to override the training manifest and validation manifest, the number of epochs to train, and the place to save the model checkpoint:
tao speech_to_text_conformer train -e <experiment_spec> \
-g <num_gpus> \
training_ds.manifest_filepath=<training_manifest_filepath> \
validation_ds.manifest_filepath=<val_manifest_filepath> \
trainer.max_epochs=<epochs_to_train> \
save_to='<file_path>.tlt'
Required Arguments
-e
: The experiment specification file to set up training, as in the example given above.
Optional Arguments
-g
: The number of GPUs to be used in the training in a multi-GPU scenario. The default value is 1.-r
: The path to the results and log directory. Log files, checkpoints, etc. will be stored here.-k
: The key to encrypt the model.Other arguments to override fields in the specification file.
Training Procedure
At the start of each training experiment, TAO Toolkit will print out a log of the experiment specification, including any parameters added or overridden via the command line. It will also show additional information, such as which GPUs are available, where logs will be saved, how many hours are in each loaded dataset, and how much of each dataset has been filtered.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
[NeMo W 2021-01-20 21:41:53 exp_manager:375] Exp_manager is logging to ./, but it already exists.
[NeMo I 2021-01-20 21:41:53 exp_manager:323] Resuming from checkpoints/trained-model-last.ckpt
[NeMo I 2021-01-20 21:41:53 exp_manager:186] Experiments will be logged at .
...
[NeMo I 2021-01-20 21:41:54 collections:173] Dataset loaded with 948 files totaling 0.71 hours
[NeMo I 2021-01-20 21:41:54 collections:174] 0 files were filtered totaling 0.00 hours
[NeMo I 2021-01-20 21:41:54 collections:173] Dataset loaded with 130 files totaling 0.10 hours
[NeMo I 2021-01-20 21:41:54 collections:174] 0 files were filtered totaling 0.00 hours
You should next see a full printout of the number of parameters in each module and submodule, as well as the total number of trainable and non-trainable parameters in the model.
In the following table, the encoder module contains 121 million parameters, and its submodule encoder.pre_encode contains 7.6 million of those parameters. Of those 7.6 million parameters, 5.2 million are from the Linear layer , 2.4 million are from the Sequential layer, and another 2.4 million are from the Conv2D layer. Also listed are the ReLU and dropout submodules, with no parameters.
| Name | Type | Params
-------------------------------------------------------------------------------------------
0 | preprocessor | AudioToMelSpectrogramPreprocessor | 0
1 | preprocessor.featurizer | FilterbankFeatures | 0
2 | encoder | ConformerEncoder | 121 M
3 | encoder.pre_encode | ConvSubsampling | 7.6 M
4 | encoder.pre_encode.out | Linear | 5.2 M
5 | encoder.pre_encode.conv | Sequential | 2.4 M
6 | encoder.pre_encode.conv.0 | Conv2d | 5.1 K
7 | encoder.pre_encode.conv.1 | ReLU | 0
8 | encoder.pre_encode.conv.2 | Conv2d | 2.4 M
9 | encoder.pos_enc | RelPositionalEncoding | 0
10 | encoder.pos_enc.dropout | Dropout | 0
11 | encoder.layers | ModuleList | 113 M
12 | encoder.layers.0 | ConformerLayer | 6.3 M
13 | encoder.layers.0.norm_feed_forward1 | LayerNorm | 1.0 K
...
517 | decoder | ConvASRDecoder | 67.2 K
518 | decoder.decoder_layers | Sequential | 67.2K
519 | decoder.decoder_layers.0 | Conv1d | 67.2 K
520 | loss | CTCLoss | 0
521 | spec_augmentation | SpectrogramAugmentation | 0
522 | spec_augmentation.spec_augment | SpecAugment | 0
523 | _wer | WERBPE | 0
-------------------------------------------------------------------------------------------
121 M Trainable params
0 Non-trainable params
121 M Total params
486.009 Total estimated model params size (MB)
As the model starts training, you should see a progress bar per epoch.
Epoch 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.40it/s, loss=62.5Epoch 0, global step 29: val_loss reached 307.90469 (best 307.90469), saving model to "/tlt-nemo/checkpoints/trained-model---val_loss=307.90-epoch=0.ckpt" as top 3
Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.48it/s, loss=57.6Epoch 1, global step 59: val_loss reached 70.93443 (best 70.93443), saving model to "/tlt-nemo/checkpoints/trained-model---val_loss=70.93-epoch=1.ckpt" as top 3
Epoch 2: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.42it/s, loss=55.5Epoch 2, global step 89: val_loss reached 465.39551 (best 70.93443), saving model to "/tlt-nemo/checkpoints/trained-model---val_loss=465.40-epoch=2.ckpt" as top 3
Epoch 3: 60%|█████████████████████████████████████████████████████▍ | 21/35 [00:09<00:06, 2.19it/s, loss=54.5]
...
At the end of training, TAO Toolkit will save the last checkpoint at the path specified by the experiment spec file before finishing.
[NeMo I 2021-01-20 22:38:48 train:120] Experiment logs saved to '.'
[NeMo I 2021-01-20 22:38:48 train:123] Trained model saved to './checkpoints/trained-model.tlt'
INFO: Internal process exited
Troubleshooting
Currently, only
.wav
audio files are supported.If you are training on a non-English dataset and are consistently getting blank predictions, check that you have set normalize_transcripts=False. By default, the data layers have normalization on and will get rid of non-English characters.
Similarly, if you are training on an English dataset with capital letters or additional punctuation, ensure that the data layer normalizes transcripts to lowercase, or that your custom vocabulary includes the additional valid characters.
If you consistently run into out-of-memory errors while training, consider adding a maximum length to your audio samples using max_duration.
To run evaluation on a trained model checkpoint, use this command:
tao speech_to_text_conformer evaluate -e <experiment_spec> \
-m <model_checkpoint> \
-g <num_gpus>
Required Arguments
-e
: The experiment specification file to set up evaluation. This only requires a dataset config, as described in the “Dataset Configs” section.-m
: The path to the model checkpoint, which should be a.tlt
file
Optional Arguments
-g
: The number of GPUs to be used in evaluation in a multi-GPU scenario. The default value is 1.-r
: The path to the results and log directory Log files, checkpoints, etc., will be stored here.-k
: The key to encrypt the modelOther arguments to override fields in the specification file.
Evaluation Procedure
At the start of evaluation, TAO Toolkit will print out a log of the experiment specification, including any parameters added or overridden via the command line. It will also show additional information, such as which GPUs are available, where logs will be saved, and how many hours are in the test dataset.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
[NeMo W 2021-01-20 22:58:19 exp_manager:375] Exp_manager is logging to ./, but it already exists.
[NeMo I 2021-01-20 22:58:19 exp_manager:186] Experiments will be logged at .
...
[NeMo I 2021-01-20 22:58:20 features:235] PADDING: 16
[NeMo I 2021-01-20 22:58:20 features:251] STFT using torch
[NeMo I 2021-01-20 22:58:22 collections:173] Dataset loaded with 130 files totaling 0.10 hours
[NeMo I 2021-01-20 22:58:22 collections:174] 0 files were filtered totaling 0.00 hours
Once evaluation begins, a progress bar will be shown to indicate how many batches have been processed. After evaluation, the test results will be shown.
Testing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 2.43it/s]
--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_loss': tensor(68.1998, device='cuda:0'),
'test_wer': tensor(0.9987, device='cuda:0')}
--------------------------------------------------------------------------------
[NeMo I 2021-01-20 22:58:24 evaluate:94] Experiment logs saved to '.'
INFO: Internal process exited
Troubleshooting
Currently, only
.wav
audio files are supported.Filtering should be turned off for evaluation. Make sure that
max_duration
andmin_duration
are not set.For best results, perform evaluation on audio files with the same sample rate as the training data.
The model will only predict characters that were included in the training vocabulary. Make sure that the training and test vocabularies match, including normalization.
To fine-tune a model from a checkpoint, use the following command:
tao speech_to_text_conformer finetune -e <experiment_spec> \
-m <model_checkpoint> \
-g <num_gpus>
Required Arguments
-e
: The experiment specification file to set up fine-tuning. This requires thetrainer
,save_to
, andoptim
configurations described in the “Training Process Configs” section, as well asfinetuning_ds
andvalidation_ds
configs, as described in the “Dataset Configs” section. Additionally, if your fine-tuning dataset has a different vocabulary (i.e. set of labels) than the trained model checkpoint, you must also setchange_vocabulary: true
at the top level of your specification file.-m
: The path to the model checkpoint from which to fine-tune. The model checkpoint should be a.tlt
file.
Optional Arguments
-g
: The number of GPUs to use for fine-tuning in a multi-GPU scenario. The default value is 1.-r
: The path to the results and log directory. Log files, checkpoints, etc. will be stored here.-k
: The key to encrypt the model.Other arguments to override fields in the specification file.
Fine-Tuning Procedure
At the start of fine-tuning, TAO Toolkit will print out a log of the experiment specification, including any parameters added or overridden via the command line. It will also show additional information, such as which GPUs are available, where logs will be saved, and how many hours are in the fine-tuning and evaluation dataset.
When restoring from the checkpoint, it will then log the original datasets that the checkpoint model was trained and evaluated on.
If the vocabulary has been changed, the logs will show what the new vocabulary is.
[NeMo I 2021-01-20 23:33:12 finetune:110] Model restored from './checkpoints/trained-model.tlt'
[NeMo I 2021-01-20 23:33:12 ctc_models:247] Changed decoder to output to [' ', 'а', 'б', 'в', 'г', 'д', 'е', 'ё', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я'] vocabulary.
[NeMo I 2021-01-20 23:33:12 collections:173] Dataset loaded with 7242 files totaling 11.74 hours
[NeMo I 2021-01-20 23:33:12 collections:174] 0 files were filtered totaling 0.00 hours
[NeMo I 2021-01-20 23:33:12 collections:173] Dataset loaded with 7307 files totaling 12.56 hours
[NeMo I 2021-01-20 23:33:12 collections:174] 0 files were filtered totaling 0.00 hours
As with training, TAO Toolkit will log a full listing of the modules and submodules in the model, as well as the total number of trainable and non-trainable parameters in the model. See the Training section for more details on the parameter breakdowns.
| Name | Type | Params
-------------------------------------------------------------------------------------------
0 | preprocessor | AudioToMelSpectrogramPreprocessor | 0
1 | preprocessor.featurizer | FilterbankFeatures | 0
2 | encoder | ConformerEncoder | 121 M
...
517 | decoder | ConvASRDecoder | 67.2 K
518 | decoder.decoder_layers | Sequential | 67.2K
519 | decoder.decoder_layers.0 | Conv1d | 67.2 K
520 | loss | CTCLoss | 0
-------------------------------------------------------------------------------------------
121 M Trainable params
0 Non-trainable params
121 M Total params
486.009 Total estimated model params size (MB)
Note that if the vocabulary has been changed, the decoder may have a different number of parameters.
Fine-tuning on the new dataset should proceed afterwards as with normal training, with a progress bar per epoch and checkpoints saved to the specified directory.
Troubleshooting
Currently, only
.wav
audio files are supported.We recommend using a lower learning rate for fine-tuning from a trained model checkpoint. A good criteria to start with is 1/100 of the original learning rate.
If the fine-tuning vocabulary is different from the original training vocabulary, you will need to set change_vocabulary=True.
You may see a dimensionality mismatch error (example below) or other hyperparameter mismatch error if your training checkpoint directory (i.e. the model you are loading with
restore_from
) and fine-tuning checkpoint directory are the same. Use the -r flag to ensure they are distinct (e.g., -r <new/log/dir>).
RuntimeError: Error(s) in loading state_dict for EncDecCTCModel:
`size mismatch for decoder.decoder_layers.0.weight: copying a param with shape torch.Size([29, 1024, 1]) from checkpoint, the shape in current model is torch.Size([35, 1024, 1]).`
To perform inference on individual audio files, use the following command:
tao speech_to_text_conformer infer -e <experiment_spec> \
-m <model_checkpoint> \
-g <num_gpus>
Required Arguments
-e
: The experiment specification file to set up inference. This spec file only needs afile_paths
parameter that contains a list of individual file paths.-m
: The path to the model checkpoint. Should be a.tlt
file.
Optional Arguments
-g
: The number of GPUs to be used for inference in a multi-GPU scenario. The default value is 1.-r
: The path to the results and log directory. Log files, checkpoints, etc., will be stored here.-k
: The key to encrypt the model.Other arguments to override fields in the specification file.
Inference Procedure
At the start of inference, TAO Toolkit will print out the experiment specification, including the audio filepaths on which inference will be performed.
When restoring from the checkpoint, it will then log the original datasets that the checkpoint model was trained and evaluated on. This will show the vocabulary that the model was trained on.
Train config :
manifest_filepath: /data/an4/train_manifest.json
batch_size: 32
sample_rate: 16000
labels:
- ' '
- a
- b
- c
...
Prediction results will be shown at the end of the log. Each prediction is preceded by the associated filename on the previous line.
[NeMo I 2021-01-21 00:22:00 infer:67] The prediction results:
[NeMo I 2021-01-21 00:22:00 infer:69] File: /data/an4/wav/an4test_clstk/fcaw/an406-fcaw-b.wav
[NeMo I 2021-01-21 00:22:00 infer:70] Predicted transcript: rubout g m e f three nine
[NeMo I 2021-01-21 00:22:00 infer:69] File: /data/an4/wav/an4test_clstk/fcaw/an407-fcaw-b.wav
[NeMo I 2021-01-21 00:22:00 infer:70] Predicted transcript: erase c q q f seven
[NeMo I 2021-01-21 00:22:00 infer:73] Experiment logs saved to '.'
INFO: Internal process exited
Troubleshooting
Currently, only
.wav
audio files are supported.For best results, perform inference on audio files with the same sample rate as the training data.
The model will only predict characters that were included in the training vocabulary. Ensure that the training and test vocabularies match, including normalization.
You can export a trained ASR model to the Riva format, which contains all the model artifacts necessary for deployment to Riva Services. For more details about Riva, see this page.
To export an ASR model to the Riva format, use the following command:
tao speech_to_text_conformer export -e <experiment_spec> \
-m <model_checkpoint> \
-r <results_dir> \
-k <encryption_key> \
export_format=RIVA
Required Arguments
-e
: The experiment specification file for export. See the Export Spec File <export_spec_file_conformer> section below for more details.-m
: The path to the model checkpoint to be exported. The model checkpoint should be a.tlt
file.
Optional Arguments
-k
: The encryption key-r
: The path to the directory where results will be stored
Export Spec File
The following is an example spec file for model export.
# Name of the .tlt EFF archive to be loaded/model to be exported.
restore_from: trained-model.tlt
# Set export format: RIVA
export_format: RIVA
# Output EFF archive containing model checkpoint and artifacts required for Riva Services
export_to: exported-model.riva
Parameter |
Datatype |
Description |
Default |
---|---|---|---|
|
string |
The path to the pre-trained model to be exported |
|
|
string |
The export format |
N/A |
|
string |
The target path for the export model |
|