Speech Recognition#

Conformer-CTC#

The Conformer-CTC model is a non-autoregressive variant of the Conformer model for Automatic Speech Recognition that uses CTC loss/decoding instead of Transducer. For more information, refer to Conformer-CTC Model.

The models provided with Riva are a large-size version of Conformer-CTC (around 120M parameters) trained on large proprietary datasets.

Citrinet#

Citrinet is the recommended new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. Citrinet is a deep residual neural model that uses 1D time-channel separable convolutions combined with subword encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive and sequence-to-sequence and transducer models.

Details on the model architecture can be found in the paper Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition.

The models provided with Riva are larger variant Citrinet-1024 (around 142M parameters) and smaller variant Citrinet-256 (around 9.8M parameters). Both variants are trained on large proprietary datasets.

Due to its lower resource usage capabilities, Citrinet-256 is the preferred model for deployment on embedded platforms.

Jasper#

Caution

Jasper is no longer recommended for new deployments.

Jasper (“Just Another SPEech Recognizer”) is an end-to-end neural acoustic model for ASR that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting the strict real-time requirements of ASR systems in deployment.

The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences corresponding to a given audio segment during a post-processing step called decoding.

Details on the model architecture can be found in the paper Jasper: An End-to-End Convolutional Neural Acoustic Model.

QuartzNet#

Caution

QuartzNet is no longer recommended for new deployments.

QuartzNet is the next generation of the Jasper model architecture with separable convolutions and larger filters. It can achieve accuracy similar to Jasper but with an order of magnitude fewer parameters. Similarly to Jasper, the QuartzNet families of models are denoted as QuartzNet_[BxR] where B is the number of blocks and R is the number of convolutional sub-blocks within a block. Each sub-block contains a 1D separable convolution, batch normalization, ReLU, and dropout.

Details on the model architecture can be found in the paper QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions.

NVIDIA Riva Skills

Speech Recognition

Contents

Speech Recognition#

Conformer-CTC#

Citrinet#

Jasper#

QuartzNet#