NVIDIA ASR NIM Support Matrix#

This page describes the software and hardware that the NVIDIA ASR NIM supports.

Hardware#

The NVIDIA ASR NIM requires an NVIDIA GPU with a Compute Capability of 8.0 or higher and at least 16 GB of VRAM. Ensure that the models you deploy do not exceed the available GPU memory.

GPU

Precision

A30, A100

FP16

H100

FP16, FP8

A2, A10, A16, A40

FP16

L4, L40, GeForce RTX 40xx

FP16, FP8

GeForce RTX 50xx

FP16

Blackwell RTX 60xx

FP16

DGX Spark *

FP16

WSL2-compatible models include support for all RTX 40xx GPUs and later.

Note

Only Parakeet 1.1B CTC English and Parakeet 1.1B RNNT Multilingual models support DGX Spark platform (support extended in the Riva ASR NIM Release 1.8.0).

Software#

  • Linux operating systems (Ubuntu 22.04 or later recommended)

  • NVIDIA Driver >= 535

  • NVIDIA Docker >= 23.0.1

  • A Windows 11 operating system (Build 23H2 and later) that is supported using Windows Subsystem for Linux:

    1. The minimum supported driver version is 570.

    2. The minimum supported Linux distribution is Ubuntu 24.04.

    3. The recommended container management tool is Podman.

Supported Models#

The NVIDIA ASR NIM microservice supports the following Nemotron ASR models.

The microservice requires an NVIDIA GPU with Compute Capability >= 8.0. It automatically downloads a pre-built model when available for the target GPU or generates an optimized model on-the-fly using the RMIR model.

The environment variable NIM_TAGS_SELECTOR is used to specify the desired model and inference mode. It is specified as comma-separated key-value pairs. Some ASR models support different inference modes tuned for different use cases. Available modes include streaming low latency (str), streaming high throughput (str-thr), and offline (ofl). Setting the mode to all deploys all inference modes where applicable.

Note

Parakeet 0.6b CTC English (en-US) uses FP8 on supported hardware. All other models use FP16.

Parakeet 0.6b CTC English#

Model information

To use this model, set CONTAINER_ID to parakeet-0-6b-ctc-en-us . Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=parakeet-0-6b-ctc-en-us,bs=1,mode=ofl,diarizer=disabled,vad=default

offline

1

4.511

3.08

name=parakeet-0-6b-ctc-en-us,bs=1,mode=str,diarizer=disabled,vad=default

streaming

1

4.676

3.07

name=parakeet-0-6b-ctc-en-us,mode=ofl,diarizer=disabled,vad=default

offline

1024

5.201

11.93

name=parakeet-0-6b-ctc-en-us,mode=str,diarizer=disabled,vad=default

streaming

1024

1.54

3.07

name=parakeet-0-6b-ctc-en-us,mode=str-thr,diarizer=disabled,vad=default

streaming-throughput

1024

2.257

7.02

name=parakeet-0-6b-ctc-en-us,mode=all,diarizer=disabled,vad=default

all

1024

13.85

21.73

Note

Profiles with a Batch Size of 1 are optimized for the lowest memory usage and support only a single session at a time. These profiles are recommended for WSL2 deployment or scenarios with a single inference request client.

Speech Recognition with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Standard English with Silero VAD & Sortformer Diarizer:

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=parakeet-0-6b-ctc-en-us,mode=ofl,diarizer=sortformer,vad=silero

offline

1024

2.885

11.93

name=parakeet-0-6b-ctc-en-us,mode=str-thr,diarizer=sortformer,vad=silero

streaming-throughput

1024

5.387

7.02

name=parakeet-0-6b-ctc-en-us,mode=str,diarizer=sortformer,vad=silero

streaming

1024

4.967

6.39

name=parakeet-0-6b-ctc-en-us,mode=all,diarizer=sortformer,vad=silero

all

1024

5.32

21.73

Parakeet 1.1b CTC English#

Model information

To use this model, set CONTAINER_ID to parakeet-1-1b-ctc-en-us. Choose a value for NIM_TAGS_SELECTOR from the following tables as needed. For further instructions, refer to Launching the ASR NIM.

Standard English Speech Recognition#

The following table lists standard profiles for general English speech recognition.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

diarizer=disabled,mode=all,vad=default

all

1024

3.734

11.39

diarizer=disabled,mode=ofl,vad=default

offline

1024

2.248

5.83

diarizer=disabled,mode=str-thr,vad=default

streaming-throughput

1024

2.512

5.05

diarizer=disabled,mode=str,vad=default

streaming

1024

2.19

4.13

Speech Recognition with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Standard English with Silero VAD & Sortformer Diarizer#

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl,vad=silero,diarizer=sortformer

offline

1024

2.917

12.82

mode=str,vad=silero,diarizer=sortformer

streaming

1024

2.758

7.23

mode=str-thr,vad=silero,diarizer=sortformer

streaming-throughput

1024

2.657

7.87

diarizer=sortformer,mode=all,vad=silero

all

1024

9.632

47.22

Speech Recognition in True Offline Mode#

The profiles with true-ofl use Silero VAD to detect silences to segment long audio files into chunks of up-to 30s and then parallelize the inference for all chunks. This profile is useful for applications where the audios are long and the user wants to process it in offline fashion.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

diarizer=disabled,mode=true-ofl,vad=silero

offline

1024

2.761

17.59

Note

Parakeet 1.1b CTC English is supported on DGX Spark platform.

Parakeet 0.6b TDT#

Parakeet 0.6b TDT is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction.

These are the key features of this model:

  • Accurate word-level timestamp predictions

  • Automatic punctuation and capitalization

  • Robust performance on spoken numbers and song lyrics transcription

Model Types#

Parakeet-tdt-0.6b-v2 (type=default)

  • English-only model optimized for en-US transcription

  • Refer to Parakeet TDT 0.6B V2 for more details

Parakeet-tdt-0.6b-v3 (type=multi)

  • Multilingual model supporting 25 European languages

  • Refer to Parakeet TDT 0.6B V3 for more details

Supported Languages by Model Type#

Language

Language Code

Default

Multi

Bulgarian

bg-BG

Croatian

hr-HR

Czech

cs-CZ

Danish

da-DK

Dutch

nl-NL

English (UK)

en-GB

English (US)

en-US

Estonian

et-EE

Finnish

fi-FI

French

fr-FR

German

de-DE

Greek

el-GR

Hungarian

hu-HU

Italian

it-IT

Latvian

lv-LV

Lithuanian

lt-LT

Maltese

mt-MT

Polish

pl-PL

Portuguese

pt-PT

Romanian

ro-RO

Russian

ru-RU

Slovak

sk-SK

Slovenian

sl-SI

Spanish

es-ES

Swedish

sv-SE

Ukrainian

uk-UA

To use this model, set CONTAINER_ID to parakeet-0.6b-tdt. For further instructions, refer to Launching the ASR NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=parakeet-0.6b-tdt,type=default

offline

1024

4.7

14

name=parakeet-0.6b-tdt,type=multi

offline

1024

4.7

14

Parakeet 1.1b RNNT Multilingual#

Model information

Parakeet 1.1b RNNT Multilingual model supports streaming speech-to-text transcription in multiple languages. Three model types are available, each optimized for different use cases:

Model Types#

Default Type (type=default)

  • Supports automatic language detection - the model identifies the spoken language and provides the transcript accordingly

  • Model produces the detected language code as output.

Prompt Type (type=prompt)

  • Offers better accuracy compared to the default model

  • Does not support automatic language detection - language code must be passed from the client

Indic Type (type=indic)

  • Optimized for Indic languages

  • Supports automatic language detection but does not produce detected language code in output.

Supported Languages by Model Type#

Language

Language Code

Default

Prompt

Indic

Arabic

ar-AR

Bengali (India)

bn-IN

Czech

cs-CZ

Danish

da-DK

German

de-DE

English (UK)

en-GB

English (US)

en-US

Spanish (Spain)

es-ES

Spanish (US)

es-US

French (Canada)

fr-CA

French (France)

fr-FR

Hebrew

he-IL

Hindi (India)

hi-IN

Italian

it-IT

Japanese

ja-JP

Korean

ko-KR

Norwegian Bokmål

nb-NO

Dutch

nl-NL

Norwegian Nynorsk

nn-NO

Polish

pl-PL

Portuguese (Brazil)

pt-BR

Portuguese (Portugal)

pt-PT

Russian

ru-RU

Swedish

sv-SE

Tamil (India)

ta-IN

Thai

th-TH

Turkish

tr-TR

Deployment Instructions#

To use this model, set CONTAINER_ID to parakeet-1-1b-rnnt-multilingual. Choose a value for NIM_TAGS_SELECTOR from the following table based on the model type needed. For further instructions, refer to Launching the ASR NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl,diarizer=disabled

offline

1024

2.884

11.41

mode=str,diarizer=disabled

streaming

1024

2.759

9.77

mode=str-thr,diarizer=disabled

streaming-throughput

1024

2.919

10.69

mode=all,diarizer=disabled

all

1024

6.057

28.64

Speaker Diarization#

The profiles with sortformer use Sortformer SD for speaker diarization. It is useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Standard English with Sortformer Diarizer:

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl,diarizer=sortformer

offline

1024

3.277

13.97

mode=str,diarizer=sortformer

streaming

1024

3.17

12.46

mode=str-thr,diarizer=sortformer

streaming-throughput

1024

3.23

13.40

mode=all,diarizer=sortformer

all

1024

7.556

36.74

Note

Parakeet 1.1b RNNT Multilingual is supported on Blackwell and DGX Spark platform.

Parakeet 0.6b CTC Vietnamese English#

Model information

Parakeet 0.6b CTC Vietnamese English code switch model supports streaming and offline speech-to-text transcription in Vietnamese + English with punctuations.

To use this model, set CONTAINER_ID to parakeet-ctc-0.6b-vi. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Speech Recognition Base Profiles#

Base profiles use acoustic model based end of utterance detection.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl,vad=default,diarizer=disabled

offline

1024

2.089

7.27

mode=str,vad=default,diarizer=disabled

streaming

1024

2.152

5.27

mode=str-thr,vad=default,diarizer=disabled

streaming-throughput

1024

2.122

6.27

mode=all,vad=default,diarizer=disabled

all

1024

3.824

15.75

Speech Recognition Profiles with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl,vad=silero,diarizer=sortformer

offline

1024

2.698

13.47

mode=str,vad=silero,diarizer=sortformer

streaming

1024

2.504

7.58

mode=str-thr,vad=silero,diarizer=sortformer

streaming-throughput

1024

2.635

8.28

mode=all,vad=silero,diarizer=sortformer

all

1024

5.207

26.24

Parakeet 0.6b CTC Mandarin English#

Model information

Parakeet 0.6b CTC Mandarin English code switch model supports streaming and offline speech-to-text transcription in Mandarin + English with punctuations.

To use this model, set CONTAINER_ID to parakeet-ctc-0.6b-zh-cn. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Speech Recognition Base Profiles#

Base profiles use acoustic model based end of utterance detection.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl,vad=default,diarizer=disabled

offline

1024

7.9

5.6

mode=str,vad=default,diarizer=disabled

streaming

1024

4.9

4.7

mode=str-thr,vad=default,diarizer=disabled

streaming-throughput

1024

5.1

5.7

mode=all,vad=default,diarizer=disabled

all

1024

13.1

13.4

Speech Recognition Profiles with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl,vad=silero,diarizer=sortformer

offline

1024

5.6

11.5

mode=str,vad=silero,diarizer=sortformer

streaming

1024

5.0

6.7

mode=str-thr,vad=silero,diarizer=sortformer

streaming-throughput

1024

5.1

7.4

mode=all,vad=silero,diarizer=sortformer

all

1024

14.2

22.9

Parakeet 0.6b CTC Taiwanese Mandarin English#

Model information

Parakeet 0.6b CTC Taiwanese Mandarin English code switch model supports streaming and offline speech-to-text transcription in Taiwanese/Mandarin + English.

To use this model, set CONTAINER_ID to parakeet-ctc-0.6b-zh-tw. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Speech Recognition Base Profiles#

Base profiles use acoustic model based end of utterance detection.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl,vad=default,diarizer=disabled

offline

1024

4.12

5.8

mode=str,vad=default,diarizer=disabled

streaming

1024

3.75

4.9

mode=str-thr,vad=default,diarizer=disabled

streaming-throughput

1024

6.29

5.9

mode=all,vad=default,diarizer=disabled

all

1024

14.78

13.96

Speech Recognition Profiles with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl,vad=silero,diarizer=sortformer

offline

1024

5.21

12.03

mode=str,vad=silero,diarizer=sortformer

streaming

1024

4.78

6.95

mode=str-thr,vad=silero,diarizer=sortformer

streaming-throughput

1024

4.06

7.68

mode=all,vad=silero,diarizer=sortformer

all

1024

12.93

23.91

Parakeet 0.6b CTC Spanish English#

Model information

Parakeet 0.6b CTC Spanish English code switch model supports streaming and offline speech-to-text transcription in Spanish + English with punctuations.

To use this model, set CONTAINER_ID to parakeet-ctc-0.6b-es. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Speech Recognition Base Profiles#

Base profiles use acoustic model based end of utterance detection.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl,vad=default,diarizer=disabled

offline

1024

8.05

5.2

mode=str,vad=default,diarizer=disabled

streaming

1024

9.9

4.5

mode=str-thr,vad=default,diarizer=disabled

streaming-throughput

1024

5.3

8.0

mode=all,vad=default,diarizer=disabled

all

1024

13.1

12.5

Speech Recognition Profiles with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl,vad=silero,diarizer=sortformer

offline

1024

8.8

11.2

mode=str,vad=silero,diarizer=sortformer

streaming

1024

7.9

6.5

mode=str-thr,vad=silero,diarizer=sortformer

streaming-throughput

1024

7.0

8.4

mode=all,vad=silero,diarizer=sortformer

all

1024

21.15

22.2

Conformer CTC Spanish#

Model information

To use this model, set CONTAINER_ID to riva-asr. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=conformer-ctc-riva-es-us,mode=ofl

offline

1024

2

5.8

name=conformer-ctc-riva-es-us,mode=str

streaming

1024

2

3.6

name=conformer-ctc-riva-es-us,mode=str-thr

streaming-throughput

1024

2

4.2

name=conformer-ctc-riva-es-us,mode=all

all

1024

3.1

9.8

Canary 1b Multilingual#

Canary 1b is encoder-decoder model with a FastConformer Encoder and Transformer Decoder. It is a multi-lingual, multi-task model, supporting automatic speech-to-text recognition (ASR) and translation.

Model information

To use this model, set CONTAINER_ID to canary-1b. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

mode=ofl

offline

1024

6.5

13.4

Whisper Large V3 Multilingual#

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation supporting multiple languages, proposed in the paper Robust Speech Recognition using Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Refer to Whisper GitHub for more details.

To use this model, set CONTAINER_ID to whisper-large-v3. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Profile
(Selected using NIM_TAGS_SELECTOR)

Inference Mode

Batch Size

CPU Memory (GB)

GPU Memory (GB)

name=whisper-large-v3,mode=ofl

offline

1024

4.3

12.5