Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.
NeMo Speech Classification Configuration Files
This page covers NeMo configuration file setup that is specific to models in the Speech Classification collection. For general information about how to set up and run experiments that is common to all NeMo models (e.g. experiment manager and PyTorch Lightning trainer parameters), see the NeMo Models page.
The model section of NeMo Speech Classification configuration files will generally require information about the dataset(s) being used, the preprocessor for audio files, parameters for any augmentation being performed, as well as the model architecture specification. The sections on this page cover each of these in more detail.
Example configuration files for all of the NeMo ASR scripts can be found in the
<NeMo_git_root>/examples/asr/conf>
.
Dataset Configuration
Training, validation, and test parameters are specified using the train_ds
, validation_ds
, and
test_ds
sections of your configuration file, respectively.
Depending on the task, you may have arguments specifying the sample rate of your audio files, labels, whether or not to shuffle the dataset, and so on.
You may also decide to leave fields such as the manifest_filepath
blank, to be specified via the command line
at runtime.
Any initialization parameters that are accepted for the Dataset class used in your experiment can be set in the config file. See the Datasets section of the API for a list of Datasets and their respective parameters.
An example Speech Classification train and validation configuration could look like:
model:
sample_rate: 16000
repeat: 2 # number of convolutional sub-blocks within a block, R in <MODEL>_[BxRxC]
dropout: 0.0
kernel_size_factor: 1.0
labels: ['bed', 'bird', 'cat', 'dog', 'down', 'eight', 'five', 'four', 'go', 'happy', 'house', 'left', 'marvin',
'nine', 'no', 'off', 'on', 'one', 'right', 'seven', 'sheila', 'six', 'stop', 'three', 'tree', 'two', 'up',
'wow', 'yes', 'zero']
train_ds:
manifest_filepath: ???
sample_rate: ${model.sample_rate}
labels: ${model.labels} # Uses the labels above
batch_size: 128
shuffle: True
validation_ds:
manifest_filepath: ???
sample_rate: ${model.sample_rate}
labels: ${model.labels} # Uses the labels above
batch_size: 128
shuffle: False # No need to shuffle the validation data
If you would like to use tarred dataset, have a look at Datasets Configuration.
Preprocessor Configuration
Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. For details on how to write this section, refer to Preprocessor Configuration
Check config yaml files in <NeMo_git_root>/examples/asr/conf
to find the processors been used by speech classification models.
Augmentation Configurations
There are a few on-the-fly spectrogram augmentation options for NeMo ASR, which can be specified by the
configuration file using the augmentor
and spec_augment
section.
For details on how to write this section, refer to Augmentation Configuration
Check config yaml files in <NeMo_git_root>/tutorials/asr/conf
to find the processors been used by speech classification models.
Model Architecture Configurations
Each configuration file should describe the model architecture being used for the experiment.
Models in the NeMo ASR collection need a encoder
section and a decoder
section, with the _target_
field
specifying the module to use for each.
The following sections go into more detail about the specific configurations of each model architecture.
The MatchboxNet and MarbleNet models are very similar, and they are based on QuartzNet and as such the components in their configs are very similar as well.
Decoder Configurations
After features have been computed from ConvASREncoder, we pass the features to decoder to compute embeddings and then to compute log_probs for training models.
model:
...
decoder:
_target_: nemo.collections.asr.modules.ConvASRDecoderClassification
feat_in: *enc_final_filters
return_logits: true # return logits if true, else return softmax output
pooling_type: 'avg' # AdaptiveAvgPool1d 'avg' or AdaptiveMaxPool1d 'max'
Fine-tuning Execution Flow Diagram
When preparing your own training or fine-tuning scripts, please follow the execution flow diagram order for correct inference.
Depending on the type of model, there may be extra steps that must be performed -
Speech Classification models - Examples directory for Classification Models