ASR Overview#

Automatic Speech Recognition (ASR) takes an audio stream or audio buffer as input and returns one or more text transcripts, along with additional optional metadata. Speech recognition in Riva is a GPU-accelerated compute pipeline, with optimized performance and accuracy. Riva supports offline/batch and streaming recognition modes.

Customization Across Riva ASR Pipeline

Try It Out#

Try saying something

Language Support#

Riva Speech AI Skills provides high-quality pretrained models across a variety of languages. Upgraded models and new languages are released regularly.


Language Code

Acoustic Model

Language Model


Text Norm




(35000 hrs)

Streaming Offline



(2800 hrs)

Streaming Offline



(3500 hrs)

Streaming Offline



(3320 hrs)

Streaming Offline



(1908 hrs)

Streaming Offline



(1700 hrs)

Streaming Offline



(1000 hrs)

Streaming Offline



(4100 hrs)

Streaming Offline

For data center, to select which language to deploy, simply change the variable language_code in the file within the quickstart directory of the Quick Start scripts.

For embedded, only English language pretrained models are included.


Riva ASR features include:

  • Support for offline and streaming use cases

  • A streaming mode that returns intermediate transcripts with low latency

  • GPU-accelerated feature extraction

  • Multiple (and growing) acoustic model architecture options accelerated by NVIDIA TensorRT

  • Beam search decoder based on n-gram language models

  • Voice activity detection algorithms (CTC-based)

  • Automatic punctuation

  • Ability to return top-N transcripts from beam decoder

  • Word-level timestamps

  • Inverse Text Normalization (ITN)

For more information, refer to the Speech To Text Citrinet notebook and the Speech To Text Jasper and QuartzNet notebook. These notebooks provide an end-to-end workflow for speech recognition. This workflow starts with training in TAO Toolkit and ends with deployment using Riva.

Offline Recognition#

In offline or batch mode, the full audio signal is first read from a file or captured from a microphone. Following the capture of the entire signal, the client makes a request to the Riva Speech AI server to transcribe it. The client then waits for the response from the server.


This method can have long latency because the processing of the audio signal begins only after the full audio signal has been captured or read from the file.

Streaming Recognition#

In streaming recognition mode, as soon as an audio segment of a specified length is captured or read, a request is made to the server to process that segment. On the server side, a response is returned as soon as an intermediate transcript is available.


You can select the length of the audio segments based on speed and memory requirements.

Refer to the riva/proto/riva_asr.proto documentation for more details.

Multiple Deployed Models#

The Riva server supports multiple speech recognition models deployed simultaneously, up to the limit of your GPU’s memory. As such, a single-server process can host models tailored for streaming or batch, various languages, accents, or channel characteristics.

When receiving requests from the client application, the Riva server selects the deployed ASR model to use based on the RecognitionConfig of the client request. If no models are available to fulfill the request, an error is returned. In the case where multiple models might be able to fulfill the client request, one model is selected at random. You can also explicitly select which ASR model to use by setting the model field of the RecognitionConfig protobuf object to the value of <pipeline_name> which was used with the riva-build command. This enables you to deploy multiple ASR pipelines concurrently and select which one to use at runtime.