Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Models#

There are mainly two approaches in Speech Intent Classification and Slot Filling, where we can either use an End-to-End (E2E) model that directly predicts sematics from audio, or use a cascading model composed of an ASR model followed by an NLU model. E2E methods are preferred over cascading models, since it avoids error propagation from ASR to NLU and thus have better performance.

Our E2E model in NeMo is based on an Encoder-Decoder framework, where a Conformer-large module is used as the encoder to extract features, and a Transformer Decoder is applied on top of the features to predict the semantics.

sis_framework

The output is a Python dictionary object flattened as a string representation, so that the problem can be formulated as a sequence-to-sequence (audio-to-text) problem.

The model is trained by Negative Log-Likelihood (NLL) Loss with teacher forcing.