About NVIDIA ASR NIM Microservice#
The NVIDIA Automatic Speech Recognition (ASR) NIM microservice converts spoken audio into text. It packages pre-trained NeMo models with the full NVIDIA inference stack (TensorRT, Triton) into self-contained containers that handle model download, optimization, and serving.
ASR NIMs support two inference modes:
Streaming: Returns partial transcripts as audio arrives. Use for real-time applications such as live captioning and voice assistants.
Offline: Processes the full audio and returns a complete transcript. Use for batch processing of recorded files.
Available Models#
ASR NIMs ship multiple model families optimized for different use cases. Choose based on your language, latency, and capability requirements.
Model |
Languages |
Modes |
Key Capability |
|---|---|---|---|
English, Vietnamese, Spanish, Mandarin, Taiwanese |
Streaming + Offline |
Low-latency transcription across multiple languages |
|
English |
Offline |
Word-level timestamps |
|
Multilingual |
Offline |
Word-level timestamps |
|
Streaming + Offline |
Auto language detection across 25+ languages |
||
English |
Streaming only |
Low-latency streaming transcription |
|
Spanish |
Streaming + Offline |
Spanish transcription |
|
100+ languages |
Offline |
Transcription and translation to English |
|
26 languages |
Offline |
Transcription and bidirectional translation |
For GPU memory requirements and all available model profiles, refer to the ASR support matrix.
Choosing a Model#
Match your requirements against the axes in the following sections. For GPU memory footprints and all profile options, refer to the ASR support matrix.
By Language#
Pick a model based on the languages you need to support:
Language Requirement |
Supported Models |
|---|---|
English |
Parakeet CTC English (streaming + offline), Nemotron ASR Streaming (streaming only), or Parakeet TDT v2 (offline with word-level timestamps). |
A single non-English language |
Parakeet CTC variants cover Spanish, Mandarin, Vietnamese, and Taiwanese — each optimized for its target language plus English code-switching. Conformer CTC Spanish is an alternate Spanish option. Refer to Disambiguating Overlapping Models. |
Many languages, one deployment |
Parakeet RNNT Multilingual (25+ languages, auto-detect, streaming + offline), Parakeet TDT v3 (25 European languages, offline), or Canary 1b (26 languages, offline). |
Maximum language coverage |
Whisper Large v3 supports 100+ languages (offline only). |
By Inference Mode#
Pick a model based on how you plan to process audio:
Inference Mode |
Supported Models |
|---|---|
Real-time streaming |
Parakeet CTC variants, Parakeet RNNT Multilingual, Conformer CTC Spanish, or Nemotron ASR Streaming. |
Offline only (complete audio, one-shot) |
Parakeet TDT, Whisper Large v3, Canary 1b. |
High-throughput streaming |
Models that expose a |
By Capability#
Pick a model based on additional features you need:
Capability |
Supported Models |
|---|---|
Word-level timestamps |
Parakeet TDT and Parakeet RNNT return start and end times for each word. |
Translation |
Whisper Large v3 translates any supported language to English. Canary 1b supports bidirectional translation across 26 languages. |
Speaker diarization |
Models that expose a |
Automatic punctuation and capitalization |
All models produce punctuated and capitalized text. Refer to each model’s deploy page for defaults and activation flags. |
Disambiguating Overlapping Models#
Use the following table to pick between models that cover similar use cases:
If You Need… |
Choose |
Alternative |
Why |
|---|---|---|---|
English streaming transcription with the option to also run offline or enable diarization |
Parakeet CTC English |
Nemotron ASR Streaming |
Parakeet CTC exposes streaming, offline, and throughput profiles plus a |
A dedicated single-profile English streaming model |
Nemotron ASR Streaming |
Parakeet CTC English |
Use when you want a purpose-built streaming-only deployment and do not need offline or diarization modes. |
Spanish transcription with diarization, throughput mode, or English code-switching |
Parakeet CTC Spanish |
Conformer CTC Spanish |
Parakeet CTC Spanish exposes |
Spanish-only transcription with a smaller streaming footprint |
Conformer CTC Spanish |
Parakeet CTC Spanish |
Conformer CTC Spanish streaming profiles ( |
Multilingual European transcription with word-level timestamps |
Parakeet TDT v3 |
Parakeet RNNT Multilingual |
TDT v3 returns word-level timestamps (offline only). Choose RNNT Multilingual if you need streaming or broader language coverage. |
Next Steps#
Deploy and Run ASR Models: Step-by-step deployment and inference commands for each model.
Customize ASR Models: Word boosting, custom vocabularies, and deploying fine-tuned NeMo checkpoints.
ASR Tutorial: Deploy your first ASR NIM from scratch.