Support Matrix#

This documentation describes the software and hardware that Riva ASR NIM supports.

Hardware#

NVIDIA Riva ASR NIM is supported on NVIDIA GPUs with Compute Capability > 7.0. Avoid exceeding the available memory when selecting models to deploy; 16+ GB VRAM is recommended.

GPUs Supported#

GPU	Precision
A30, A100	FP16
H100	FP16, FP8
A2, A10, A16, A40	FP16
L4, L40, GeForce RTX 40xx	FP16, FP8
GeForce RTX 50xx	FP16
Blackwell RTX 60xx	FP16

WSL2-compatible models include support for all RTX 40xx GPUs and later.

Software#

Linux operating systems (Ubuntu 22.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1
A Windows 11 operating system (Build 23H2 and later) that is supported via Windows Subsystem for Linux:
1. The minimum supported driver version is 570.
2. The minimum supported Linux distribution is Ubuntu 24.04.
3. The recommended container management tool is Podman.

Supported Models#

Riva ASR NIM supports the following models.

NIM automatically downloads the prebuilt model if it is available on the target GPU (GPUs with Compute Capability >= 8.0) or generates an optimized model on-the-fly using RMIR model on other GPUs (Compute Capability > 7.0).

Model	Publisher	WSL2 Support
Parakeet 0.6b CTC English (en-US)	NVIDIA	✅
Parakeet 1.1b CTC English (en-US)	NVIDIA	✅
Parakeet 0.6b TDT v2 English (en-US)	NVIDIA	❌
Parakeet 1.1b RNNT Multilingual	NVIDIA	❌
Parakeet 0.6b CTC Vietnamese (vi-VN)	NVIDIA	❌
Parakeet 0.6b CTC Mandarin English (zh-CN)	NVIDIA	❌
Parakeet 0.6b CTC Mandarin Taiwanese English (zh-TW)	NVIDIA	❌
Parakeet 0.6b CTC Spanish English (es-US)	NVIDIA	❌
Conformer CTC Spanish (es-US)	NVIDIA	❌
Canary 1b Multilingual	NVIDIA	❌
Whisper Large v3 Multilingual	OpenAI	❌

The environment variable NIM_TAGS_SELECTOR is used to specify the desired model and inference mode. It is specified as comma-separated key-value pairs. Some ASR models support different inference modes tuned for different use cases. Available modes include streaming low latency (str), streaming high throughput (str-thr), and offline (ofl). Setting the mode to all deploys all inference modes where applicable.

Note

Parakeet 0.6b CTC English (en-US) uses FP8 on supported hardware. All other models use FP16.

Parakeet 0.6b CTC English#

Model information

To use this model, set CONTAINER_ID to parakeet-0-6b-ctc-en-us . Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=parakeet-0-6b-ctc-en-us,bs=1,mode=ofl,diarizer=disabled,vad=default`	offline	1	4.511	3.08
`name=parakeet-0-6b-ctc-en-us,bs=1,mode=str,diarizer=disabled,vad=default`	streaming	1	4.676	3.07
`name=parakeet-0-6b-ctc-en-us,mode=ofl,diarizer=disabled,vad=default`	offline	1024	5.201	11.93
`name=parakeet-0-6b-ctc-en-us,mode=str,diarizer=disabled,vad=default`	streaming	1024	1.54	3.07
`name=parakeet-0-6b-ctc-en-us,mode=str-thr,diarizer=disabled,vad=default`	streaming-throughput	1024	2.257	7.02
`name=parakeet-0-6b-ctc-en-us,mode=all,diarizer=disabled,vad=default`	all	1024	13.85	21.73

Note

Profiles with a Batch Size of 1 are optimized for the lowest memory usage and support only a single session at a time. These profiles are recommended for WSL2 deployment or scenarios with a single inference request client.

Speech Recognition with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Standard English with Silero VAD & Sortformer Diarizer:

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=parakeet-0-6b-ctc-en-us,mode=ofl,diarizer=sortformer,vad=silero`	offline	1024	2.885	11.93
`name=parakeet-0-6b-ctc-en-us,mode=str-thr,diarizer=sortformer,vad=silero`	streaming-throughput	1024	5.387	7.02
`name=parakeet-0-6b-ctc-en-us,mode=str,diarizer=sortformer,vad=silero`	streaming	1024	4.967	6.39
`name=parakeet-0-6b-ctc-en-us,mode=all,diarizer=sortformer,vad=silero`	all	1024	5.32	21.73

Parakeet 1.1b CTC English#

Model information

To use this model, set CONTAINER_ID to parakeet-1-1b-ctc-en-us. Choose a value for NIM_TAGS_SELECTOR from the following tables as needed. For further instructions, refer to Launching the NIM.

Standard English Speech Recognition#

The following table lists standard profiles for general English speech recognition.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=parakeet-1-1b-ctc-en-us,mode=ofl,vad=default,diarizer=disabled`	offline	1024	4.432	6.61
`name=parakeet-1-1b-ctc-en-us,mode=str,vad=default,diarizer=disabled`	streaming	1024	4.687	4.93
`name=parakeet-1-1b-ctc-en-us,mode=str-thr,vad=default,diarizer=disabled`	streaming-throughput	1024	3.633	5.79
`name=parakeet-1-1b-ctc-en-us,mode=all`	all	1024	11.44	13.71

Telephony-Optimized Speech Recognition#

Profiles with tele are recommended for telephony use cases where speech has channel distortions/artifacts.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=parakeet-1-1b-ctc-tele-en-us,mode=ofl,vad=default,diarizer=disabled`	offline	1024	3.058	6.60
`name=parakeet-1-1b-ctc-tele-en-us,mode=str,vad=default,diarizer=disabled`	streaming	1024	3.987	4.93
`name=parakeet-1-1b-ctc-tele-en-us,mode=str-thr,vad=default,diarizer=disabled`	streaming-throughput	1024	3.461	5.81
`name=parakeet-1-1b-ctc-tele-en-us,mode=all`	all	1024	9.865	15.68

Speech Recognition with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Standard English with Silero VAD & Sortformer Diarizer#

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=parakeet-1-1b-ctc-en-us,mode=ofl,vad=silero,diarizer=sortformer`	offline	1024	3.46	11.90
`name=parakeet-1-1b-ctc-en-us,mode=str,vad=silero,diarizer=sortformer`	streaming	1024	4.413	6.32
`name=parakeet-1-1b-ctc-en-us,mode=str-thr,vad=silero,diarizer=sortformer`	streaming-throughput	1024	4.4	6.96
`name=parakeet-1-1b-ctc-en-us,mode=all,vad=silero,diarizer=sortformer`	all	1024	11.93	22.4

Telephony with Silero VAD & Sortformer Diarizer#

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=parakeet-1-1b-ctc-tele-en-us,mode=ofl,vad=silero,diarizer=sortformer`	offline	1024	3.85	11.90
`name=parakeet-1-1b-ctc-tele-en-us,mode=str,vad=silero,diarizer=sortformer`	streaming	1024	4.58	6.32
`name=parakeet-1-1b-ctc-tele-en-us,mode=str-thr,vad=silero,diarizer=sortformer`	streaming-throughput	1024	3.16	6.96
`name=parakeet-1-1b-ctc-tele-en-us,mode=all,vad=silero,diarizer=sortformer`	all	1024	8.02	22.4

Speech Recognition in True Offline Mode#

The profiles with true-ofl use Silero VAD to detect silences to segment long audio files into chunks of upto 30s and then parallelize the inference for all chunks. This profile is useful for applications where the audios are long and the user wants to process it in offline fasion.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=parakeet-1-1b-ctc-en-us,mode=true-ofl,vad=silero`	offline	1024	4.66	17.4
`name=parakeet-1-1b-ctc-tele-en-us,mode=true-ofl,vad=silero`	offline	1024	4.66	17.4

Parakeet 0.6b TDT v2 English#

Parakeet 0.6b TDT v2 is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction.

These are the key features of this model:

Accurate word-level timestamp predictions
Automatic punctuation and capitalization
Robust performance on spoken numbers and song lyrics transcription

Refer to Parakeet TDT 0.6B V2 for more details.

To use this model, set CONTAINER_ID to parakeet-tdt-0.6b-v2. For further instructions, refer to Launching the NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=parakeet-tdt-0.6b-v2,mode=ofl`	offline	1024	4.7	14

Note: Parakeet 0.6b TDT is not supported on Blackwell GPUs.

Parakeet 1.1b RNNT Multilingual#

Model information

Parakeet 1.1b RNNT Multilingual model supports streaming speech-to-text transcription in multiple languages. The model identifies the spoken language and provides the transcript corresponding to the spoken language.

List of supported languages - en-US, en-GB, es-ES, ar-AR, es-US, pt-BR, fr-FR, de-DE, it-IT, ja-JP, ko-KR, ru-RU, hi-IN, he-IL, nb-NO, nl-NL, cs-CZ, da-DK, fr-CA, pl-PL, sv-SE, th-TH, tr-TR, pt-PT, and nn-NO Recommended languages - en-US, en-GB, es-ES, ar-AR, es-US, pt-BR, fr-FR, de-DE, it-IT, ja-JP, ko-KR, ru-RU, and hi-IN

To use this model, set CONTAINER_ID to parakeet-1-1b-rnnt-multilingual. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,diarizer=disabled`	offline	1024	11.7	10.57
`mode=str,diarizer=disabled`	streaming	1024	12.73	9.61
`mode=str-thr,diarizer=disabled`	streaming-throughput	1024	13.08	10.33
`mode=all,diarizer=disabled`	all	1024	28.66	27.37

Speaker Diarization#

The profiles with sortformer use Sortformer SD for speaker diarization. It is useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Standard English with Sortformer Diarizer:

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,diarizer=sortformer`	offline	1024	11.02	20.14
`mode=str,diarizer=sortformer`	streaming	1024	13.27	12.29
`mode=str-thr,diarizer=sortformer`	streaming-throughput	1024	12.19	13.05
`mode=all,diarizer=sortformer`	all	1024	36.42	42.32

Note: Parakeet 1.1b RNNT Multilingual is not supported on Blackwell GPUs.

Parakeet 0.6b CTC Vietnamese English#

Model information

Parakeet 0.6b CTC Vietnamese English code switch model supports streaming and offline speech-to-text transcription in Vietnamese + English with punctuations.

To use this model, set CONTAINER_ID to parakeet-ctc-0.6b-vi. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Speech recognition base profiles#

Base profiles use acoustic model based end of utterance detection.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=default,diarizer=disabled`	offline	1024	3.8	7.5
`mode=str,vad=default,diarizer=disabled`	streaming	1024	3.3	5.5
`mode=str-thr,vad=default,diarizer=disabled`	streaming-throughput	1024	4	6.5
`mode=all,vad=default,diarizer=disabled`	all	1024	5.4	16.2

Speech recognition profiles with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=silero,diarizer=sortformer`	offline	1024	4.4	13.8
`mode=str,vad=silero,diarizer=sortformer`	streaming	1024	3.6	7.8
`mode=str-thr,vad=silero,diarizer=sortformer`	streaming-throughput	1024	4.2	8.6
`mode=all,vad=silero,diarizer=sortformer`	all	1024	16.5	24.5

Parakeet 0.6b CTC Mandarin English#

Model information

Parakeet 0.6b CTC Mandarin English code switch model supports streaming and offline speech-to-text transcription in Mandarin + English with punctuations.

To use this model, set CONTAINER_ID to parakeet-ctc-0.6b-zh-cn. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Speech recognition base profiles#

Base profiles use acoustic model based end of utterance detection.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=default,diarizer=disabled`	offline	1024	7.9	5.6
`mode=str,vad=default,diarizer=disabled`	streaming	1024	4.9	4.7
`mode=str-thr,vad=default,diarizer=disabled`	streaming-throughput	1024	5.1	5.7
`mode=all,vad=default,diarizer=disabled`	all	1024	13.1	13.4

Speech recognition profiles with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=silero,diarizer=sortformer`	offline	1024	5.6	11.5
`mode=str,vad=silero,diarizer=sortformer`	streaming	1024	5.0	6.7
`mode=str-thr,vad=silero,diarizer=sortformer`	streaming-throughput	1024	5.1	7.4
`mode=all,vad=silero,diarizer=sortformer`	all	1024	14.2	22.9

Parakeet 0.6b CTC Taiwanese Mandarin English#

Model information

Parakeet 0.6b CTC Taiwanese Mandarin English code switch model supports streaming and offline speech-to-text transcription in Taiwanese/Mandarin + English.

To use this model, set CONTAINER_ID to parakeet-ctc-0.6b-zh-tw. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Speech recognition base profiles#

Base profiles use acoustic model based end of utterance detection.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=default,diarizer=disabled`	offline	1024	4.12	5.8
`mode=str,vad=default,diarizer=disabled`	streaming	1024	3.75	4.9
`mode=str-thr,vad=default,diarizer=disabled`	streaming-throughput	1024	6.29	5.9
`mode=all,vad=default,diarizer=disabled`	all	1024	14.78	13.96

Speech recognition profiles with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=silero,diarizer=sortformer`	offline	1024	5.21	12.03
`mode=str,vad=silero,diarizer=sortformer`	streaming	1024	4.78	6.95
`mode=str-thr,vad=silero,diarizer=sortformer`	streaming-throughput	1024	4.06	7.68
`mode=all,vad=silero,diarizer=sortformer`	all	1024	12.93	23.91

Parakeet 0.6b CTC Spanish English#

Model information

Parakeet 0.6b CTC Spanish English code switch model supports streaming and offline speech-to-text transcription in Spanish + English with punctuations.

To use this model, set CONTAINER_ID to parakeet-ctc-0.6b-es. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Speech recognition base profiles#

Base profiles use acoustic model based end of utterance detection.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=default,diarizer=disabled`	offline	1024	8.05	5.2
`mode=str,vad=default,diarizer=disabled`	streaming	1024	9.9	4.5
`mode=str-thr,vad=default,diarizer=disabled`	streaming-throughput	1024	5.3	8.0
`mode=all,vad=default,diarizer=disabled`	all	1024	13.1	12.5

Speech recognition profiles with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=silero,diarizer=sortformer`	offline	1024	8.8	11.2
`mode=str,vad=silero,diarizer=sortformer`	streaming	1024	7.9	6.5
`mode=str-thr,vad=silero,diarizer=sortformer`	streaming-throughput	1024	7.0	8.4
`mode=all,vad=silero,diarizer=sortformer`	all	1024	21.15	22.2

Conformer CTC Spanish#

Model information

To use this model, set CONTAINER_ID to riva-asr. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=conformer-ctc-riva-es-us,mode=ofl`	offline	1024	2	5.8
`name=conformer-ctc-riva-es-us,mode=str`	streaming	1024	2	3.6
`name=conformer-ctc-riva-es-us,mode=str-thr`	streaming-throughput	1024	2	4.2
`name=conformer-ctc-riva-es-us,mode=all`	all	1024	3.1	9.8

Canary 1b Multilingual#

Canary 1b is encoder-decoder model with a FastConformer Encoder and Transformer Decoder. It is a multi-lingual, multi-task model, supporting automatic speech-to-text recognition (ASR) and translation.

Model information

To use this model, set CONTAINER_ID to canary-1b. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl`	offline	1024	6.5	13.4

Whisper Large v3 Multilingual#

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation supporting multiple languages, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Refer to Whisper GitHub for more details.

To use this model, set CONTAINER_ID to whisper-large-v3. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=whisper-large-v3,mode=ofl`	offline	1024	4.3	12.5