Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
NeMo Speech Intent Classification and Slot Filling Configuration Files#
This page covers NeMo configuration file setup that is specific to models in the Speech Intent Classification and Slot Filling collection. For general information about how to set up and run experiments that is common to all NeMo models (e.g. experiment manager and PyTorch Lightning trainer parameters), see the NeMo Models page.
Dataset Configuration#
Dataset configuration for Speech Intent Classification and Slot Filling model is mostly the same as for standard ASR training,
covered here. One exception is that use_start_end_token
must be set to True
.
An example of train and validation configuration should look similar to the following:
model:
train_ds:
manifest_filepath: ???
sample_rate: ${model.sample_rate}
batch_size: 16 # you may increase batch_size if your memory allows
shuffle: true
num_workers: 8
pin_memory: false
use_start_end_token: true
trim_silence: false
max_duration: 11.0
min_duration: 0.0
# tarred datasets
is_tarred: false
tarred_audio_filepaths: null
shuffle_n: 2048
# bucketing params
bucketing_strategy: "synced_randomized"
bucketing_batch_size: null
validation_ds:
manifest_filepath: ???
sample_rate: ${model.sample_rate}
batch_size: 16 # you may increase batch_size if your memory allows
shuffle: false
num_workers: 8
pin_memory: true
use_start_end_token: true
min_duration: 8.0
Preprocessor Configuration#
Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. For details on how to write this section, refer to Preprocessor Configuration
Augmentation Configurations#
There are a few on-the-fly spectrogram augmentation options for NeMo ASR, which can be specified by the
configuration file using the augmentor
and spec_augment
section.
For details on how to write this section, refer to Augmentation Configuration
Model Architecture Configurations#
The encoder
of the model is a Conformer-large model without the text decoder, and can be initialized with pretrained checkpoints. The decoder
is a Transforemr model, with additional embedding
and classifier
modules.
An example config for the model can be:
pretrained_encoder:
name: stt_en_conformer_ctc_large # which model use to initialize the encoder, set to null if not using any. Only used to initialize training, not used in resuming from checkpoint.
freeze: false # whether to freeze the encoder during training.
model:
sample_rate: 16000
encoder:
_target_: nemo.collections.asr.modules.ConformerEncoder
feat_in: ${model.preprocessor.features}
feat_out: -1 # you may set it if you need different output size other than the default d_model
n_layers: 17 # SSL conformer-large have only 17 layers
d_model: 512
# Sub-sampling params
subsampling: striding # vggnet or striding, vggnet may give better results but needs more memory
subsampling_factor: 4 # must be power of 2
subsampling_conv_channels: -1 # -1 sets it to d_model
# Reduction parameters: Can be used to add another subsampling layer at a given position.
# Having a 2x reduction will speedup the training and inference speech while keeping similar WER.
# Adding it at the end will give the best WER while adding it at the beginning will give the best speedup.
reduction: null # pooling, striding, or null
reduction_position: null # Encoder block index or -1 for subsampling at the end of encoder
reduction_factor: 1
# Feed forward module's params
ff_expansion_factor: 4
# Multi-headed Attention Module's params
self_attention_model: rel_pos # rel_pos or abs_pos
n_heads: 8 # may need to be lower for smaller d_models
# [left, right] specifies the number of steps to be seen from left and right of each step in self-attention
att_context_size: [-1, -1] # -1 means unlimited context
xscaling: true # scales up the input embeddings by sqrt(d_model)
untie_biases: true # unties the biases of the TransformerXL layers
pos_emb_max_len: 5000
# Convolution module's params
conv_kernel_size: 31
conv_norm_type: 'batch_norm' # batch_norm or layer_norm
### regularization
dropout: 0.1 # The dropout used in most of the Conformer Modules
dropout_pre_encoder: 0.1 # The dropout used before the encoder
dropout_emb: 0.0 # The dropout used for embeddings
dropout_att: 0.1 # The dropout for multi-headed attention modules
embedding:
_target_: nemo.collections.asr.modules.transformer.TransformerEmbedding
vocab_size: -1
hidden_size: ${model.encoder.d_model}
max_sequence_length: 512
num_token_types: 1
embedding_dropout: 0.0
learn_positional_encodings: false
decoder:
_target_: nemo.collections.asr.modules.transformer.TransformerDecoder
num_layers: 3
hidden_size: ${model.encoder.d_model}
inner_size: 2048
num_attention_heads: 8
attn_score_dropout: 0.0
attn_layer_dropout: 0.0
ffn_dropout: 0.0
classifier:
_target_: nemo.collections.common.parts.MultiLayerPerceptron
hidden_size: ${model.encoder.d_model}
num_classes: -1
num_layers: 1
activation: 'relu'
log_softmax: true
Loss Configurations#
The loss function by default is the negative log-likelihood loss, where optional label-smoothing can be applied by using the following config (default is 0.0):
loss:
label_smoothing: 0.0
Inference Configurations#
During inference, three types of sequence generation strategies can be applied: greedy search
, beam search
and top-k search
.
sequence_generator:
type: greedy # choices=[greedy, topk, beam]
max_sequence_length: ${model.embedding.max_sequence_length}
temperature: 1.0 # for top-k sampling
beam_size: 1 # K for top-k sampling, N for beam search
len_pen: 0 # for beam-search