Models

This page gives a brief overview of the models that NeMo’s Speech Classification collection currently supports. For Speech Classification, we support Speech Command (Keyword) Detection and Voice Activity Detection (VAD).

Each of these models can be used with the example ASR scripts (in the <NeMo_git_root>/examples/asr directory) by specifying the model architecture in the config file used. Examples of config files for each model can be found in the <NeMo_git_root>/examples/asr/conf directory.

For more information about the config files and how they should be structured, see the NeMo Speech Classification Configuration Files page.

Pretrained checkpoints for all of these models, as well as instructions on how to load them, can be found on the Checkpoints page. You can use the available checkpoints for immediate inference, or fine-tune them on your own datasets. The Checkpoints page also contains benchmark results for the available ASR models.

MatchboxNet (Speech Commands)

MatchboxNet [SC-MODELS2] is an end-to-end neural network for speech command recognition based on QuartzNet.

Similarly to QuartzNet, the MatchboxNet family of models are denoted as MatchBoxNet_[BxRxC] where B is the number of blocks, and R is the number of convolutional sub-blocks within a block, and C is the number of channels. Each sub-block contains a 1-D separable convolution, batch normalization, ReLU, and dropout:

MatchboxNet model

It can reach state-of-the art accuracy on the Google Speech Commands dataset while having significantly fewer parameters than similar models. The _v1 and _v2 are denoted for models trained on v1 (30-way classification) and v2 (35-way classification) datasets; And we use _subset_task to represent (10+2)-way subset (10 specific classes + other remaining classes + silence) classification task.

MatchboxNet models can be instantiated using the EncDecClassificationModel class.

Note

For model details and deep understanding about Speech Command Detedction training, inference, finetuning and etc., please refer to <NeMo_git_root>/tutorials/asr/Speech_Commands.ipynb and <NeMo_git_root>/tutorials/asr/Online_Offline_Speech_Commands_Demo.ipynb.

MarbleNet (VAD)

MarbleNet [SC-MODELS1] an end-to-end neural network for speech command recognition based on MatchboxNet (Speech Commands),

Similarly to MatchboxNet, the MarbleNet family of models are denoted as MarbleNet_[BxRxC] where B is the number of blocks, and R is the number of convolutional sub-blocks within a block, and C is the number of channels. Each sub-block contains a 1-D separable convolution, batch normalization, ReLU, and dropout:

MarbleNet model

It can reach state-of-the art performance on the difficult AVA speech dataset while having significantly fewer parameters than similar models even training on simple data. MarbleNet models can be instantiated using the EncDecClassificationModel class.

Note

For model details and deep understanding about VAD training, inference, postprocessing, threshold tuning and etc., please refer to <NeMo_git_root>/tutorials/asr/06_Voice_Activiy_Detection.ipynb and <NeMo_git_root>/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb.

References

SC-MODELS1

Fei Jia, Somshubra Majumdar, and Boris Ginsburg. Marblenet: deep 1d time-channel separable convolutional neural network for voice activity detection. arXiv preprint arXiv:2010.13886, 2020.

SC-MODELS2

Somshubra Majumdar and Boris Ginsburg. MatchboxNet: 1d time-channel separable convolutional neural network architecture for speech commands recognition. Proc. Interspeech 2020, 2020.