Speech Recognition#


Recommended Streaming Offline

The Parakeet model is based on the Fast Conformer architecture for Automatic Speech Recognition (ASR), which is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling, modified convolution kernel size, and an efficient subsampling module. The model is trained end-to-end using the Connectionist Temporal Classification (CTC) decoder. For more information, refer to Fast-Conformer-CTC Model

The models provided with Riva are a xl-size (around 600M parameters) version of Fast-Conformer-CTC trained on large proprietary datasets.


Recommended Streaming Offline

The Conformer-CTC model is a non-autoregressive variant of the Conformer model for Automatic Speech Recognition (ASR) that uses CTC loss/decoding instead of Transducer. For more information, refer to Conformer-CTC Model.

The models provided with Riva are a large-size (around 120M parameters) and a xl-size (around 600M parameters) version of Conformer-CTC trained on large proprietary datasets.

Based on the training recipe, the Conformer-CTC also has two more variants other than the Conformer-CTC base model: Unified Conformer-CTC and Multilingual Code Switch Conformer-CTC. The Unified variant can transcribes speech along with punctuations, whereas the Multilingual Code Switch variant can transcribe two or more languages.


Recommended Streaming Offline


Citrinet is no longer recommended for new deployments.

Citrinet is the recommended new end-to-end convolutional Connectionist Temporal Classification (CTC) based ASR model. Citrinet is a deep residual neural model that uses 1D time-channel separable convolutions combined with subword encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive and sequence-to-sequence and transducer models.

Details on the model architecture can be found in the Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition paper.

The models provided with Riva are larger variant Citrinet-1024 (around 142M parameters) and smaller variant Citrinet-256 (around 9.8M parameters). Both variants are trained on large proprietary datasets.

Due to its lower resource usage capabilities, Citrinet-256 is the preferred model for deployment on embedded platforms.


Recommended Streaming Offline


Jasper is no longer recommended for new deployments.

Jasper (“Just Another SPEech Recognizer”) is an end-to-end neural acoustic model for ASR that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting the strict real-time requirements of ASR systems in deployment.

The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences corresponding to a given audio segment during a post-processing step called decoding.

Details on the model architecture can be found in the Jasper: An End-to-End Convolutional Neural Acoustic Model paper.


Recommended Streaming Offline


QuartzNet is no longer recommended for new deployments.

QuartzNet is the next generation of the Jasper model architecture with separable convolutions and larger filters. It can achieve accuracy similar to Jasper but with an order of magnitude fewer parameters. Similarly to Jasper, the QuartzNet families of models are denoted as QuartzNet_[BxR] where B is the number of blocks and R is the number of convolutional sub-blocks within a block. Each sub-block contains a 1D separable convolution, batch normalization, ReLU, and dropout.

Details on the model architecture can be found in the paper QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions.


MarbleNet is an end-to-end neural network for Voice Activity Detection (VAD). It is a deep residual network composed from blocks of 1D time-channel separable convolution, batchnormalization, ReLU, and dropout layers. When compared to a state-of-the-art VAD model, MarbleNet is able to achieve similar performance with roughly 1/10-th the parameter cost.

Details on the model architecture can be found in the MarbleNet: Deep 1d Time-channel Separable Convolutional Neural Network For Voice Activity Detection paper.


TitaNet is a novel neural network architecture for extracting speaker representations. It employs 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector). TitaNet is a scalable architecture that achieves state-of-the-art performance on speaker verification tasks. Furthermore, we have various sizes of TitaNet, including a light TitaNet-S model with only 6M parameters that achieves near state-of-the-art results in diarization tasks.

Details on the model architecture can be found in the TitaNet: Neural Model For Speaker Representation With 1d Depth-wise Separable Convolutions And Global Context paper.