Support Matrix#

This documentation describes the software and hardware that Riva ASR NIM supports.

Hardware#

NVIDIA Riva ASR NIM is supported on NVIDIA GPUs with Compute Capability > 7.0. Avoid exceeding the available memory when selecting models to deploy; 16+ GB VRAM is recommended.

GPUs Supported#

GPU

Precision

A30, A100

FP16

H100

FP16

A2, A10, A16, A40

FP16

L4, L40, GeForce RTX 40xx

FP16

GeForce RTX 50xx

FP16

Blackwell RTX 60xx

FP16

WSL2-compatible models include support for all RTX 40xx GPUs and later.

Software#

  • Linux operating systems (Ubuntu 22.04 or later recommended)

  • NVIDIA Driver >= 535

  • NVIDIA Docker >= 23.0.1

Supported Models#

Riva ASR NIM supports the following models.

NIM automatically downloads the prebuilt model if it is available on the target GPU (GPUs with Compute Capability >= 8.0) or generates an optimized model on-the-fly using RMIR model on other GPUs (Compute Capability > 7.0).

The environment variable NIM_TAGS_SELECTOR is used to specify the desired model and inference mode. It is specified as comma-separated key-value pairs. Some ASR models support different inference modes tuned for different use cases. Available modes include streaming low latency (str), streaming high throughput (str-thr), and offline (ofl). Setting the mode to all deploys all inference modes where applicable.

Note

All models use FP16 precision.

Parakeet 0.6b CTC English#

Model information

To use this model, set CONTAINER_ID to parakeet-0-6b-ctc-en-us. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl

offline

1024

3

5.8

mode=str

streaming

1024

3

4

mode=str-thr

streaming-throughput

1024

3

5

mode=all

all

1024

5.3

11.5

mode=ofl,bs=1

offline

1

3

3

mode=str,bs=1

streaming

1

3

3

Note

Profiles with a Batch Size of 1 are optimized for the lowest memory usage and support only a single session at a time. These profiles are recommended for WSL2 deployment or scenarios with a single inference request client.

Parakeet 1.1b CTC English#

Model information

To use this model, set CONTAINER_ID to parakeet-1-1b-ctc-en-us. Choose a value for NIM_TAGS_SELECTOR from the following tables as needed. For further instructions, refer to Launching the NIM.

Standard English Speech Recognition#

The following table lists standard profiles for general English speech recognition.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=parakeet-1-1b-ctc-en-us,mode=ofl,vad=default,diarizer=disabled

offline

1024

4.432

6.61

name=parakeet-1-1b-ctc-en-us,mode=str,vad=default,diarizer=disabled

streaming

1024

4.687

4.93

name=parakeet-1-1b-ctc-en-us,mode=str-thr,vad=default,diarizer=disabled

streaming-throughput

1024

3.633

5.79

name=parakeet-1-1b-ctc-en-us,mode=all

all

1024

11.44

13.71

Telephony-Optimized Speech Recognition#

Profiles with tele are recommended for telephony use cases where speech has channel distortions/artifacts.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=parakeet-1-1b-ctc-tele-en-us,mode=ofl,vad=default,diarizer=disabled

offline

1024

3.058

6.60

name=parakeet-1-1b-ctc-tele-en-us,mode=str,vad=default,diarizer=disabled

streaming

1024

3.987

4.93

name=parakeet-1-1b-ctc-tele-en-us,mode=str-thr,vad=default,diarizer=disabled

streaming-throughput

1024

3.461

5.81

name=parakeet-1-1b-ctc-tele-en-us,mode=all

all

1024

9.865

15.68

Speech Recognition with VAD based End of Utterance#

The profiles with silero use Silero VAD to detect start and end of utterance. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles.

Standard English with Silero VAD:

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=parakeet-1-1b-ctc-en-us,mode=ofl,vad=silero,diarizer=disabled

offline

1024

5.068

7.26

name=parakeet-1-1b-ctc-en-us,mode=str,vad=silero,diarizer=disabled

streaming

1024

4.53

5.58

name=parakeet-1-1b-ctc-en-us,mode=str-thr,vad=silero,diarizer=disabled

streaming-throughput

1024

3.797

6.46

name=parakeet-1-1b-ctc-en-us,mode=all,vad=silero

all

1024

10.97

15.68

Telephony with Silero VAD:

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=parakeet-1-1b-ctc-tele-en-us,mode=ofl,vad=silero,diarizer=disabled

offline

1024

4.183

7.26

name=parakeet-1-1b-ctc-tele-en-us,mode=str,vad=silero,diarizer=disabled

streaming

1024

4.32

5.58

name=parakeet-1-1b-ctc-tele-en-us,mode=str-thr,vad=silero,diarizer=disabled

streaming-throughput

1024

4.481

6.46

name=parakeet-1-1b-ctc-tele-en-us,mode=all,vad=silero

all

1024

11.93

15.75

Speech Recognition with Speaker Diarization#

The profiles with sortformer use Sortformer model for speaker diarization. This is useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Standard English with Speaker Diarization:

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=parakeet-1-1b-ctc-en-us,mode=ofl,vad=default,diarizer=sortformer

offline

1024

3.314

12.77

name=parakeet-1-1b-ctc-en-us,mode=str,vad=default,diarizer=sortformer

streaming

1024

5.053

7.19

name=parakeet-1-1b-ctc-en-us,mode=str-thr,vad=default,diarizer=sortformer

streaming-throughput

1024

2.952

7.83

name=parakeet-1-1b-ctc-en-us,mode=all,diarizer=sortformer

all

1024

13.55

24.21

Telephony with Speaker Diarization:

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=parakeet-1-1b-ctc-tele-en-us,mode=ofl,vad=default,diarizer=sortformer

offline

1024

4.937

12.77

name=parakeet-1-1b-ctc-tele-en-us,mode=str,vad=default,diarizer=sortformer

streaming

1024

2.842

7.19

name=parakeet-1-1b-ctc-tele-en-us,mode=str-thr,vad=default,diarizer=sortformer

streaming-throughput

1024

5.029

7.83

name=parakeet-1-1b-ctc-tele-en-us,mode=all,diarizer=sortformer

all

1024

13.27

24.21

Parakeet 0.6b TDT v2 English#

Parakeet 0.6b TDT v2 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction.

These are the key features of this model:

  • Accurate word-level timestamp predictions

  • Automatic punctuation and capitalization

  • Robust performance on spoken numbers and song lyrics transcription

Refer to Parakeet TDT 0.6B V2 for more details.

To use this model, set CONTAINER_ID to parakeet-tdt-0.6b-v2. For further instructions, refer to Launching the NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=parakeet-tdt-0.6b-v2,mode=ofl

offline

1024

4.7

14

Parakeet 1.1b RNNT Multilingual#

Model information

Parakeet 1.1b RNNT Multilingual model supports streaming speech-to-text transcription in multiple languages. The model identifies the spoken language and provides the transcript corresponding to the spoken language.

List of supported languages - en-US, en-GB, es-ES, ar-AR, es-US, pt-BR, fr-FR, de-DE, it-IT, ja-JP, ko-KR, ru-RU, hi-IN, he-IL, nb-NO, nl-NL, cs-CZ, da-DK, fr-CA, pl-PL, sv-SE, th-TH, tr-TR, pt-PT, and nn-NO Recommended languages - en-US, en-GB, es-ES, ar-AR, es-US, pt-BR, fr-FR, de-DE, it-IT, ja-JP, ko-KR, ru-RU, and hi-IN

To use this model, set CONTAINER_ID to parakeet-1-1b-rnnt-multilingual. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl

offline

1024

5.6

10.8

mode=str

streaming

1024

6

7.9

mode=str-thr

streaming-throughput

1024

5.5

8.8

mode=all

all

1024

16.5

24.5

Conformer CTC Spanish#

Model information

To use this model, set CONTAINER_ID to riva-asr. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=conformer-ctc-riva-es-us,mode=ofl

offline

1024

2

5.8

name=conformer-ctc-riva-es-us,mode=str

streaming

1024

2

3.6

name=conformer-ctc-riva-es-us,mode=str-thr

streaming-throughput

1024

2

4.2

name=conformer-ctc-riva-es-us,mode=all

all

1024

3.1

9.8

Canary 1b Multilingual#

Canary 1b is encoder-decoder model with a FastConformer Encoder and Transformer Decoder. It is a multi-lingual, multi-task model, supporting automatic speech-to-text recognition (ASR) and translation.

Model information

To use this model, set CONTAINER_ID to riva-asr. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=canary-1b,mode=ofl

offline

1024

6.5

13.4

Canary 0.6b Turbo Multilingual#

Canary 1b is encoder-decoder model with a FastConformer Encoder and Transformer Decoder. It is a multi-lingual, multi-task model, supporting automatic speech-to-text recognition (ASR) and translation.

Model information

To use this model, set CONTAINER_ID to riva-asr. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=canary-0-6b-turbo,mode=ofl

offline

1024

5.3

12.2

Whisper Large v3 Multilingual#

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation supporting multiple languages, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Refer to Whisper GitHub for more details.

To use this model, set CONTAINER_ID to whisper-large-v3. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=whisper-large-v3,mode=ofl

offline

1024

4.3

12.5