Is this page helpful?

About NVIDIA ASR NIM Microservice#

The NVIDIA Automatic Speech Recognition (ASR) NIM microservice converts spoken audio into text. It packages pre-trained NeMo models with the full NVIDIA inference stack (TensorRT, Triton) into self-contained containers that handle model download, optimization, and serving.

ASR NIMs support two inference modes:

Streaming: Returns partial transcripts as audio arrives. Use for real-time applications such as live captioning and voice assistants.
Offline: Processes the full audio and returns a complete transcript. Use for batch processing of recorded files.

Available Models#

ASR NIMs ship multiple model families optimized for different use cases. Choose based on your language, latency, and capability requirements.

Model	Languages	Modes	Key Capability
Parakeet CTC	English, Vietnamese, Spanish, Mandarin, Taiwanese	Streaming + Offline	Low-latency transcription across multiple languages
Parakeet TDT v2	English	Offline	Word-level timestamps
Parakeet TDT v3	Multilingual	Offline	Word-level timestamps
Parakeet RNNT Multilingual	25+ languages	Streaming + Offline	Auto language detection across 25+ languages
Nemotron ASR Streaming	English, multilingual (40 language locales)	Streaming only	Low-latency streaming transcription with optional multilingual automatic language detection
Conformer CTC	Spanish	Streaming + Offline	Spanish transcription
Whisper Large v3	100+ languages	Offline	Transcription and translation to English
Canary 1b	26 languages	Offline	Transcription and bidirectional translation

For GPU memory requirements and all available model profiles, refer to the ASR support matrix.

Choosing a Model#

Match your requirements against the axes in the following sections. For GPU memory footprints and all profile options, refer to the ASR support matrix.

By Language#

Pick a model based on the languages you need to support:

Language Requirement	Supported Models
English	Parakeet CTC English (streaming + offline), Nemotron ASR Streaming (streaming only), or Parakeet TDT v2 (offline with word-level timestamps).
A single non-English language	Parakeet CTC variants cover Spanish, Mandarin, Vietnamese, and Taiwanese — each optimized for its target language plus English code-switching. Conformer CTC Spanish is an alternate Spanish option. Refer to Disambiguating Overlapping Models.
Many languages, one deployment	Parakeet RNNT Multilingual (25+ languages, auto-detect, streaming + offline), Parakeet TDT v3 (25 European languages, offline), or Canary 1b (26 languages, offline).
Maximum language coverage	Whisper Large v3 supports 100+ languages (offline only).

By Inference Mode#

Pick a model based on how you plan to process audio:

Inference Mode	Supported Models
Real-time streaming	Parakeet CTC variants, Parakeet RNNT Multilingual, Conformer CTC Spanish, or Nemotron ASR Streaming.
Offline only (complete audio, one-shot)	Parakeet TDT, Whisper Large v3, Canary 1b.
High-throughput streaming	Models that expose a `mode=str-thr` profile — Parakeet CTC variants, Parakeet RNNT Multilingual, and Conformer CTC Spanish.

By Capability#

Pick a model based on additional features you need:

Capability	Supported Models
Word-level timestamps	Parakeet TDT and Parakeet RNNT return start and end times for each word.
Translation	Whisper Large v3 translates any supported language to English. Canary 1b supports bidirectional translation across 26 languages.
Speaker diarization	Models that expose a `diarizer=sortformer` profile — Parakeet CTC variants and Parakeet RNNT Multilingual.
Automatic punctuation and capitalization	All models produce punctuated and capitalized text. Refer to each model’s deploy page for defaults and activation flags.

Disambiguating Overlapping Models#

Use the following table to pick between models that cover similar use cases:

If You Need…	Choose	Alternative	Why
English streaming transcription with the option to also run offline or enable diarization	Parakeet CTC English	Nemotron ASR Streaming	Parakeet CTC exposes streaming, offline, and throughput profiles plus a `diarizer=sortformer` option. Nemotron ASR Streaming ships a single streaming-only profile.
A dedicated single-profile English streaming model	Nemotron ASR Streaming	Parakeet CTC English	Use when you want a purpose-built streaming-only deployment and do not need offline or diarization modes.
Spanish transcription with diarization, throughput mode, or English code-switching	Parakeet CTC Spanish	Conformer CTC Spanish	Parakeet CTC Spanish exposes `diarizer=sortformer` profiles and supports Spanish plus English code-switching.
Spanish-only transcription with a smaller streaming footprint	Conformer CTC Spanish	Parakeet CTC Spanish	Conformer CTC Spanish streaming profiles (`mode=str`, `mode=str-thr`) have smaller GPU footprints than the Parakeet equivalents.
Multilingual European transcription with word-level timestamps	Parakeet TDT v3	Parakeet RNNT Multilingual	TDT v3 returns word-level timestamps (offline only). Choose RNNT Multilingual if you need streaming or broader language coverage.

Next Steps#

Deploy and Run ASR Models: Step-by-step deployment and inference commands for each model.
Customize ASR Models: Word boosting, custom vocabularies, and deploying fine-tuned NeMo checkpoints.
ASR Tutorial: Deploy your first ASR NIM from scratch.