Models¶

This page gives a brief overview of the models that NeMo’s ASR collection currently supports.

Each of these models can be used with the example ASR scripts (in the <NeMo_git_root>/examples/asr directory) by specifying the model architecture in the config file used. Examples of config files for each model can be found in the <NeMo_git_root>/examples/asr/conf directory.

For more information about the config files and how they should be structured, see the NeMo ASR Configuration Files page.

Pretrained checkpoints for all of these models, as well as instructions on how to load them, can be found on the Checkpoints page. You can use the available checkpoints for immediate inference, or fine-tune them on your own datasets. The Checkpoints page also contains benchmark results for the available ASR models.

Jasper¶

Jasper (“Just Another SPeech Recognizer”) [ASR-MODELS5] is a deep time delay neural network (TDNN) comprising of blocks of 1D-convolutional layers. The Jasper family of models are denoted as Jasper_[BxR] where B is the number of blocks, and R is the number of convolutional sub-blocks within a block. Each sub-block contains a 1-D convolution, batch normalization, ReLU, and dropout:

Jasper models can be instantiated using the EncDecCTCModel class.

QuartzNet¶

QuartzNet [ASR-MODELS4] is a version of Jasper [ASR-MODELS5] model with separable convolutions and larger filters. It can achieve performance similar to Jasper but with an order of magnitude fewer parameters. Similarly to Jasper, the QuartzNet family of models are denoted as QuartzNet_[BxR] where B is the number of blocks, and R is the number of convolutional sub-blocks within a block. Each sub-block contains a 1-D separable convolution, batch normalization, ReLU, and dropout:

QuartzNet models can be instantiated using the EncDecCTCModel class.

Citrinet¶

Citrinet is a version of QuartzNet [ASR-MODELS4] that extends ContextNet [ASR-MODELS2], utilizing subword encoding (via Word Piece tokenization) and Squeeze-and-Excitation mechanism [ASR-MODELS3] to obtain highly accurate audio transcripts while utilizing a non-autoregressive CTC based decoding scheme for efficient inference.

Citrinet models can be instantiated using the EncDecCTCModelBPE class.

Conformer-CTC¶

Conformer-CTC is a CTC-based variant of Conformer model introduced in [ASR-MODELS1]. Conformer-CTC has similar encoder as original Conformer but uses CTC loss and decoding instead of RNNT loss, which makes it a non-autoregressive model. We also drops the LSTM decoder and instead use a linear decoder on the top of the encoder. This model uses the combination of self-attention and convolution modules to achieve the best of the two approaches, the self-attention layers can learn the global interaction while the convolutions efficiently capture the local correlations. The self-attention modules support both regular self-attention with absolute positional encoding, and also Transformer-XL’s self-attention with relative positional encodings.

Here is the overall architecture of the encoder of Conformer-CTC:

This model supports both the sub-word level and character level encodings. You may find more detail on the config files for Conformer-CTC models at Conformer-CTC <./configs.html#conformer-ctc>. The variant with sub-word encoding is a BPE-based model which can be instantiated using the EncDecCTCModelBPE class, while the character-based variant is based on EncDecCTCModel.

References¶

ASR-MODELS1: Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and others. Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
ASR-MODELS2: Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, and Yonghui Wu. Contextnet: improving convolutional neural networks for automatic speech recognition with global context. arXiv:2005.03191, 2020.
ASR-MODELS3: Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In ICVPR. 2018.
ASR-MODELS4(1,2): Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, and Yang Zhang. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. arXiv preprint arXiv:1910.10261, 2019.
ASR-MODELS5(1,2): Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde. Jasper: an end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288, 2019.