NVIDIA ASR NIM Support Matrix#

This page describes the software and hardware that the NVIDIA ASR NIM supports.

Hardware#

The NVIDIA ASR NIM requires an NVIDIA GPU with a Compute Capability of 8.0 or higher and at least 16 GB of VRAM. Ensure that the models you deploy do not exceed the available GPU memory.

GPU	Precision
A30, A100	FP16
H100	FP16, FP8
A2, A10, A16, A40	FP16
L4, L40, GeForce RTX 40xx	FP16, FP8
GeForce RTX 50xx	FP16
Blackwell RTX 60xx	FP16
DGX Spark ^*	FP16

WSL2-compatible models include support for all RTX 40xx GPUs and later.

Note

Only Parakeet 1.1B CTC English and Parakeet 1.1B RNNT Multilingual models support DGX Spark platform (support extended in the Riva ASR NIM Release 1.8.0).

Software#

Linux operating systems (Ubuntu 22.04 or later recommended)
NVIDIA Driver >= 535
NVIDIA Docker >= 23.0.1
A Windows 11 operating system (Build 23H2 and later) that is supported using Windows Subsystem for Linux:
1. The minimum supported driver version is 570.
2. The minimum supported Linux distribution is Ubuntu 24.04.
3. The recommended container management tool is Podman.

Supported Models#

The NVIDIA ASR NIM microservice supports the following Nemotron ASR models.

The microservice requires an NVIDIA GPU with Compute Capability >= 8.0. It automatically downloads a pre-built model when available for the target GPU or generates an optimized model on-the-fly using the RMIR model.

Model	Publisher	WSL2 Support
Parakeet 0.6b CTC English (en-US)	NVIDIA	✅
Parakeet 1.1b CTC English (en-US)	NVIDIA	✅
Parakeet 0.6b TDT	NVIDIA	❌
Parakeet 1.1b RNNT Multilingual	NVIDIA	❌
Parakeet 0.6b CTC Vietnamese (vi-VN)	NVIDIA	❌
Parakeet 0.6b CTC Mandarin English (zh-CN)	NVIDIA	❌
Parakeet 0.6b CTC Mandarin Taiwanese English (zh-TW)	NVIDIA	❌
Parakeet 0.6b CTC Spanish English (es-US)	NVIDIA	❌
Conformer CTC Spanish (es-US)	NVIDIA	❌
Canary 1b Multilingual	NVIDIA	❌
Whisper Large v3 Multilingual	OpenAI	❌
Nemotron ASR Streaming	NVIDIA	❌

The environment variable NIM_TAGS_SELECTOR is used to specify the desired model and inference mode. It is specified as comma-separated key-value pairs. Some ASR models support different inference modes tuned for different use cases. Available modes include streaming low latency (str), streaming high throughput (str-thr), and offline (ofl). Setting the mode to all deploys all inference modes where applicable.

Note

Parakeet 0.6b CTC English (en-US) uses FP8 on supported hardware. All other models use FP16.

Parakeet 0.6b CTC English#

Model information

To use this model, set CONTAINER_ID to parakeet-0-6b-ctc-en-us . Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=parakeet-0-6b-ctc-en-us,bs=1,mode=ofl,diarizer=disabled,vad=default`	offline	1	4.511	3.08
`name=parakeet-0-6b-ctc-en-us,bs=1,mode=str,diarizer=disabled,vad=default`	streaming	1	4.676	3.07
`name=parakeet-0-6b-ctc-en-us,mode=ofl,diarizer=disabled,vad=default`	offline	1024	5.201	11.93
`name=parakeet-0-6b-ctc-en-us,mode=str,diarizer=disabled,vad=default`	streaming	1024	1.54	3.07
`name=parakeet-0-6b-ctc-en-us,mode=str-thr,diarizer=disabled,vad=default`	streaming-throughput	1024	2.257	7.02
`name=parakeet-0-6b-ctc-en-us,mode=all,diarizer=disabled,vad=default`	all	1024	13.85	21.73

Note

Profiles with a Batch Size of 1 are optimized for the lowest memory usage and support only a single session at a time. These profiles are recommended for WSL2 deployment or scenarios with a single inference request client.

Speech Recognition with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Standard English with Silero VAD & Sortformer Diarizer:

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=parakeet-0-6b-ctc-en-us,mode=ofl,diarizer=sortformer,vad=silero`	offline	1024	2.885	11.93
`name=parakeet-0-6b-ctc-en-us,mode=str-thr,diarizer=sortformer,vad=silero`	streaming-throughput	1024	5.387	7.02
`name=parakeet-0-6b-ctc-en-us,mode=str,diarizer=sortformer,vad=silero`	streaming	1024	4.967	6.39
`name=parakeet-0-6b-ctc-en-us,mode=all,diarizer=sortformer,vad=silero`	all	1024	5.32	21.73

Parakeet 1.1b CTC English#

Model information

To use this model, set CONTAINER_ID to parakeet-1-1b-ctc-en-us. Choose a value for NIM_TAGS_SELECTOR from the following tables as needed. For further instructions, refer to Launching the ASR NIM.

This NIM includes automatic punctuation and capitalization. Punctuation is off by default. Enable it with --automatic-punctuation (CLI) or enable_automatic_punctuation=true (API). No additional model is required.

Standard English Speech Recognition#

The following table lists standard profiles for general English speech recognition.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`diarizer=disabled,mode=all,vad=default`	all	1024	3.734	11.39
`diarizer=disabled,mode=ofl,vad=default`	offline	1024	2.248	5.83
`diarizer=disabled,mode=str-thr,vad=default`	streaming-throughput	1024	2.512	5.05
`diarizer=disabled,mode=str,vad=default`	streaming	1024	2.19	4.13

Speech Recognition with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Standard English with Silero VAD & Sortformer Diarizer#

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=silero,diarizer=sortformer`	offline	1024	2.917	12.82
`mode=str,vad=silero,diarizer=sortformer`	streaming	1024	2.758	7.23
`mode=str-thr,vad=silero,diarizer=sortformer`	streaming-throughput	1024	2.657	7.87
`diarizer=sortformer,mode=all,vad=silero`	all	1024	9.632	47.22

Speech Recognition in True Offline Mode#

The profiles with true-ofl use Silero VAD to detect silences to segment long audio files into chunks of up-to 30s and then parallelize the inference for all chunks. This profile is useful for applications where the audios are long and the user wants to process it in offline fashion.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`diarizer=disabled,mode=true-ofl,vad=silero`	offline	1024	2.761	17.59

Note

Parakeet 1.1b CTC English is supported on DGX Spark platform.

Parakeet 0.6b TDT#

Parakeet 0.6b TDT is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction.

These are the key features of this model:

Accurate word-level timestamp predictions
Automatic punctuation and capitalization
Robust performance on spoken numbers and song lyrics transcription

Model Types#

Parakeet-tdt-0.6b-v2 (type=default)

English-only model optimized for en-US transcription
Refer to Parakeet TDT 0.6B V2 for more details

Parakeet-tdt-0.6b-v3 (type=multi)

Multilingual model supporting 25 European languages
Refer to Parakeet TDT 0.6B V3 for more details

Supported Languages by Model Type#

Language	Language Code	Default	Multi
Bulgarian	bg-BG		✓
Croatian	hr-HR		✓
Czech	cs-CZ		✓
Danish	da-DK		✓
Dutch	nl-NL		✓
English (UK)	en-GB		✓
English (US)	en-US	✓
Estonian	et-EE		✓
Finnish	fi-FI		✓
French	fr-FR		✓
German	de-DE		✓
Greek	el-GR		✓
Hungarian	hu-HU		✓
Italian	it-IT		✓
Latvian	lv-LV		✓
Lithuanian	lt-LT		✓
Maltese	mt-MT		✓
Polish	pl-PL		✓
Portuguese	pt-PT		✓
Romanian	ro-RO		✓
Russian	ru-RU		✓
Slovak	sk-SK		✓
Slovenian	sl-SI		✓
Spanish	es-ES		✓
Swedish	sv-SE		✓
Ukrainian	uk-UA		✓

To use this model, set CONTAINER_ID to parakeet-0.6b-tdt. For further instructions, refer to Launching the ASR NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`type=default`	offline	1024	5.329	13.75
`type=multi`	offline	1024	5.681	14.02

Parakeet 1.1b RNNT Multilingual#

Model information

Parakeet 1.1b RNNT Multilingual model supports streaming speech-to-text transcription in multiple languages. Three model types are available, each optimized for different use cases.

Tip

This NIM includes automatic punctuation and capitalization. Punctuation is off by default. Enable it with --automatic-punctuation (CLI) or enable_automatic_punctuation=true (API). No additional model is required.

Model Types#

Default Type (type=default)

Supports automatic language detection - the model identifies the spoken language and provides the transcript accordingly
Model produces the detected language code as output.

Prompt Type (type=prompt)

Offers better accuracy compared to the default model
Does not support automatic language detection - language code must be passed from the client

Indic Type (type=indic)

Optimized for Indic languages
Supports automatic language detection but does not produce detected language code in output.

Supported Languages by Model Type#

Language	Language Code	Default	Prompt	Indic
Arabic	ar-AR	✓	✓
Bengali (India)	bn-IN			✓
Czech	cs-CZ	✓
Danish	da-DK	✓
German	de-DE	✓	✓
English (UK)	en-GB	✓	✓
English (US)	en-US	✓	✓	✓
Spanish (Spain)	es-ES	✓	✓
Spanish (US)	es-US	✓	✓
French (Canada)	fr-CA	✓
French (France)	fr-FR	✓	✓
Hebrew	he-IL	✓
Hindi (India)	hi-IN	✓	✓	✓
Italian	it-IT	✓	✓
Japanese	ja-JP	✓	✓
Korean	ko-KR	✓	✓
Norwegian Bokmål	nb-NO	✓
Dutch	nl-NL	✓
Norwegian Nynorsk	nn-NO	✓
Polish	pl-PL	✓
Portuguese (Brazil)	pt-BR	✓	✓
Portuguese (Portugal)	pt-PT	✓
Russian	ru-RU	✓	✓
Swedish	sv-SE	✓
Tamil (India)	ta-IN			✓
Thai	th-TH	✓
Turkish	tr-TR	✓

Deployment Instructions#

To use this model, set CONTAINER_ID to parakeet-1-1b-rnnt-multilingual. Choose a value for NIM_TAGS_SELECTOR from the following table based on the model type needed. For further instructions, refer to Launching the ASR NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`diarizer=sortformer,mode=all,type=default,vad=silero`	all	1024	19.32	49.99
`diarizer=sortformer,mode=all,type=indic,vad=silero`	all	1024	18.22	50.07
`diarizer=sortformer,mode=all,type=prompt,vad=silero`	all	1024	22.15	49.72

Speech Recognition with VAD and Speaker Diarization#

All profiles come with Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization available, by default. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Here are all the profiles available for each type:

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`diarizer=sortformer,mode=ofl,type=default,vad=silero`	offline	1024	7.381	25.56
`diarizer=sortformer,mode=str,type=default,vad=silero`	streaming	1024	7.061	12.83
`diarizer=sortformer,mode=str-thr,type=default,vad=silero`	streaming-throughput	1024	7.216	14.75
`diarizer=sortformer,mode=all,type=default,vad=silero`	all	1024	19.32	49.99
`diarizer=sortformer,mode=ofl,type=indic,vad=silero`	offline	1024	7.067	25.64
`diarizer=sortformer,mode=str,type=indic,vad=silero`	streaming	1024	6.791	13.25
`diarizer=sortformer,mode=str-thr,type=indic,vad=silero`	streaming-throughput	1024	6.709	14.23
`diarizer=sortformer,mode=all,type=indic,vad=silero`	all	1024	18.22	50.07
`diarizer=sortformer,mode=ofl,type=prompt,vad=silero`	offline	1024	8.303	25.75
`diarizer=sortformer,mode=str,type=prompt,vad=silero`	streaming	1024	8.195	12.83
`diarizer=sortformer,mode=str-thr,type=prompt,vad=silero`	streaming-throughput	1024	8.043	14.35
`diarizer=sortformer,mode=all,type=prompt,vad=silero`	all	1024	22.15	49.72

Note

Parakeet 1.1b RNNT Multilingual is supported on Blackwell and DGX Spark platform.

Parakeet 0.6b CTC Vietnamese English#

Model information

Parakeet 0.6b CTC Vietnamese English code switch model supports streaming and offline speech-to-text transcription in Vietnamese + English with punctuations.

To use this model, set CONTAINER_ID to parakeet-ctc-0.6b-vi. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Speech Recognition Base Profiles#

Base profiles use acoustic model based end of utterance detection.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=default,diarizer=disabled`	offline	1024	2.089	7.27
`mode=str,vad=default,diarizer=disabled`	streaming	1024	2.152	5.27
`mode=str-thr,vad=default,diarizer=disabled`	streaming-throughput	1024	2.122	6.27
`mode=all,vad=default,diarizer=disabled`	all	1024	3.824	15.75

Speech Recognition Profiles with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=silero,diarizer=sortformer`	offline	1024	2.698	13.47
`mode=str,vad=silero,diarizer=sortformer`	streaming	1024	2.504	7.58
`mode=str-thr,vad=silero,diarizer=sortformer`	streaming-throughput	1024	2.635	8.28
`mode=all,vad=silero,diarizer=sortformer`	all	1024	5.207	26.24

Parakeet 0.6b CTC Mandarin English#

Model information

Parakeet 0.6b CTC Mandarin English code switch model supports streaming and offline speech-to-text transcription in Mandarin + English with punctuations.

To use this model, set CONTAINER_ID to parakeet-ctc-0.6b-zh-cn. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Speech Recognition Base Profiles#

Base profiles use acoustic model based end of utterance detection.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=default,diarizer=disabled`	offline	1024	7.9	5.6
`mode=str,vad=default,diarizer=disabled`	streaming	1024	4.9	4.7
`mode=str-thr,vad=default,diarizer=disabled`	streaming-throughput	1024	5.1	5.7
`mode=all,vad=default,diarizer=disabled`	all	1024	13.1	13.4

Speech Recognition Profiles with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=silero,diarizer=sortformer`	offline	1024	5.6	11.5
`mode=str,vad=silero,diarizer=sortformer`	streaming	1024	5.0	6.7
`mode=str-thr,vad=silero,diarizer=sortformer`	streaming-throughput	1024	5.1	7.4
`mode=all,vad=silero,diarizer=sortformer`	all	1024	14.2	22.9

Parakeet 0.6b CTC Taiwanese Mandarin English#

Model information

Parakeet 0.6b CTC Taiwanese Mandarin English code switch model supports streaming and offline speech-to-text transcription in Taiwanese/Mandarin + English.

To use this model, set CONTAINER_ID to parakeet-ctc-0.6b-zh-tw. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Speech Recognition Base Profiles#

Base profiles use acoustic model based end of utterance detection.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=default,diarizer=disabled`	offline	1024	4.12	5.8
`mode=str,vad=default,diarizer=disabled`	streaming	1024	3.75	4.9
`mode=str-thr,vad=default,diarizer=disabled`	streaming-throughput	1024	6.29	5.9
`mode=all,vad=default,diarizer=disabled`	all	1024	14.78	13.96

Speech Recognition Profiles with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=silero,diarizer=sortformer`	offline	1024	5.21	12.03
`mode=str,vad=silero,diarizer=sortformer`	streaming	1024	4.78	6.95
`mode=str-thr,vad=silero,diarizer=sortformer`	streaming-throughput	1024	4.06	7.68
`mode=all,vad=silero,diarizer=sortformer`	all	1024	12.93	23.91

Parakeet 0.6b CTC Spanish English#

Model information

Parakeet 0.6b CTC Spanish English code switch model supports streaming and offline speech-to-text transcription in Spanish + English with punctuations.

To use this model, set CONTAINER_ID to parakeet-ctc-0.6b-es. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Speech Recognition Base Profiles#

Base profiles use acoustic model based end of utterance detection.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=default,diarizer=disabled`	offline	1024	8.05	5.2
`mode=str,vad=default,diarizer=disabled`	streaming	1024	9.9	4.5
`mode=str-thr,vad=default,diarizer=disabled`	streaming-throughput	1024	5.3	8.0
`mode=all,vad=default,diarizer=disabled`	all	1024	13.1	12.5

Speech Recognition Profiles with VAD and Speaker Diarization#

The profiles with silero and sortformer use Silero VAD to detect start and end of utterance and Sortformer SD for speaker diarization. End of utterance detection using VAD is more accurate than the Acoustic model based end of utterance detection, which is used in other profiles. This profile has better robustness to noise and generates lesser spurious transcripts compared to other profiles. It is also useful for applications where multiple speakers are present in the audio, such as call centers or meetings.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl,vad=silero,diarizer=sortformer`	offline	1024	8.8	11.2
`mode=str,vad=silero,diarizer=sortformer`	streaming	1024	7.9	6.5
`mode=str-thr,vad=silero,diarizer=sortformer`	streaming-throughput	1024	7.0	8.4
`mode=all,vad=silero,diarizer=sortformer`	all	1024	21.15	22.2

Conformer CTC Spanish#

Model information

To use this model, set CONTAINER_ID to riva-asr. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=conformer-ctc-riva-es-us,mode=ofl`	offline	1024	2	5.8
`name=conformer-ctc-riva-es-us,mode=str`	streaming	1024	2	3.6
`name=conformer-ctc-riva-es-us,mode=str-thr`	streaming-throughput	1024	2	4.2
`name=conformer-ctc-riva-es-us,mode=all`	all	1024	3.1	9.8

Canary 1b Multilingual#

Canary 1b is encoder-decoder model with a FastConformer Encoder and Transformer Decoder. It is a multi-lingual, multi-task model, supporting automatic speech-to-text recognition (ASR) and translation.

Model information

To use this model, set CONTAINER_ID to canary-1b. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`mode=ofl`	offline	1024	6.5	13.4

Whisper Large V3 Multilingual#

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation supporting multiple languages, proposed in the paper Robust Speech Recognition using Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Refer to Whisper GitHub for more details.

To use this model, set CONTAINER_ID to whisper-large-v3. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size	CPU Memory (GB)	GPU Memory (GB)
`name=whisper-large-v3,mode=ofl`	offline	1024	4.3	12.5

Nemotron ASR Streaming#

Nemotron ASR Streaming supports streaming speech-to-text transcription. Two model types are available, each optimized for different use cases.

Model information

Tip

This NIM includes automatic punctuation and capitalization. Punctuation is off by default. Enable it with --automatic-punctuation (CLI) or enable_automatic_punctuation=true (API). No additional model is required.

Model Types#

type=en-US: Optimized for English (US) speech-to-text transcription and the lowest latency profile for English-only deployments.
type=multi: Supports automatic language detection across 40 language locales. Set language_code=auto to have the model identify the spoken language, or pass a specific code (for example, fr-FR) to constrain decoding.

For the full list of supported language codes, refer to Supported Languages by Model Type.

Supported Languages by Model Type#

The 40 locales in the type=multi variant are grouped into three tiers based on expected transcription quality:

Transcription-ready (19 locales): Highest-accuracy ASR, ready out of the box.
Broad-coverage (13 locales): Production ASR across an additional 13 locales.
Adaptation-ready (8 locales): Recognized by the tokenizer and designed for fine-tuning on in-domain data to unlock full transcription.

Language	Language Code	English Type	Multilingual Type	Tier
Arabic	ar-AR	No	Yes	Transcription-ready
Bulgarian	bg-BG	No	Yes	Broad-coverage
Chinese (Simplified)	zh-CN	No	Yes	Broad-coverage
Croatian	hr-HR	No	Yes	Broad-coverage
Czech	cs-CZ	No	Yes	Broad-coverage
Danish	da-DK	No	Yes	Broad-coverage
Dutch	nl-NL	No	Yes	Transcription-ready
English (UK)	en-GB	No	Yes	Transcription-ready
English (US)	en-US	Yes	Yes	Transcription-ready
Estonian	et-EE	No	Yes	Broad-coverage
Finnish	fi-FI	No	Yes	Broad-coverage
French (Canada)	fr-CA	No	Yes	Transcription-ready
French (France)	fr-FR	No	Yes	Transcription-ready
German	de-DE	No	Yes	Transcription-ready
Greek	el-GR	No	Yes	Adaptation-ready
Hebrew	he-IL	No	Yes	Adaptation-ready
Hindi (India)	hi-IN	No	Yes	Transcription-ready
Hungarian	hu-HU	No	Yes	Broad-coverage
Italian	it-IT	No	Yes	Transcription-ready
Japanese	ja-JP	No	Yes	Transcription-ready
Korean	ko-KR	No	Yes	Transcription-ready
Latvian	lv-LV	No	Yes	Adaptation-ready
Lithuanian	lt-LT	No	Yes	Adaptation-ready
Maltese	mt-MT	No	Yes	Adaptation-ready
Norwegian Bokmål	nb-NO	No	Yes	Broad-coverage
Norwegian Nynorsk	nn-NO	No	Yes	Adaptation-ready
Polish	pl-PL	No	Yes	Broad-coverage
Portuguese (Brazil)	pt-BR	No	Yes	Transcription-ready
Portuguese (Portugal)	pt-PT	No	Yes	Transcription-ready
Romanian	ro-RO	No	Yes	Broad-coverage
Russian	ru-RU	No	Yes	Transcription-ready
Slovak	sk-SK	No	Yes	Broad-coverage
Slovenian	sl-SI	No	Yes	Adaptation-ready
Spanish (Spain)	es-ES	No	Yes	Transcription-ready
Spanish (US)	es-US	No	Yes	Transcription-ready
Swedish	sv-SE	No	Yes	Broad-coverage
Thai	th-TH	No	Yes	Adaptation-ready
Turkish	tr-TR	No	Yes	Transcription-ready
Ukrainian	uk-UA	No	Yes	Transcription-ready
Vietnamese	vi-VN	No	Yes	Transcription-ready

Deployment Instructions#

To use this model, set CONTAINER_ID to nemotron-asr-streaming. Choose a value for NIM_TAGS_SELECTOR from the following table as needed. For further instructions, refer to Launching the ASR NIM.

Profile (Selected using `NIM_TAGS_SELECTOR`)	Inference Mode	Batch Size (ASR / Pipeline)	CPU Memory (GB)	GPU Memory (GB)
`name=nemotron-asr-streaming,type=en-US,batch_size=32`	streaming	32 / 64	8	6
`name=nemotron-asr-streaming,type=en-US,batch_size=64`	streaming	64 / 1024	8	14.6
`name=nemotron-asr-streaming,type=multi,batch_size=32`	streaming	32 / 64	8	6
`name=nemotron-asr-streaming,type=multi,batch_size=64`	streaming	64 / 1024	8	14.6