About NVIDIA ASR NIM Microservice#

The NVIDIA Automatic Speech Recognition (ASR) NIM microservice converts spoken audio into text. It packages pre-trained NeMo models with the full NVIDIA inference stack (TensorRT, Triton) into self-contained containers that handle model download, optimization, and serving.

ASR NIMs support two inference modes:

  • Streaming: Returns partial transcripts as audio arrives. Use for real-time applications such as live captioning and voice assistants.

  • Offline: Processes the full audio and returns a complete transcript. Use for batch processing of recorded files.

Available Models#

ASR NIMs ship multiple model families optimized for different use cases. Choose based on your language, latency, and capability requirements.

Model

Languages

Modes

Key Capability

Parakeet CTC

English, Vietnamese, Spanish, Mandarin, Taiwanese

Streaming + Offline

Low-latency transcription across multiple languages

Parakeet TDT v2

English

Offline

Word-level timestamps

Parakeet TDT v3

Multilingual

Offline

Word-level timestamps

Parakeet RNNT Multilingual

25+ languages

Streaming + Offline

Auto language detection across 25+ languages

Nemotron ASR Streaming

English

Streaming only

Low-latency streaming transcription

Conformer CTC

Spanish

Streaming + Offline

Spanish transcription

Whisper Large v3

100+ languages

Offline

Transcription and translation to English

Canary 1b

26 languages

Offline

Transcription and bidirectional translation

For GPU memory requirements and all available model profiles, refer to the ASR support matrix.

Choosing a Model#

Match your requirements against the axes in the following sections. For GPU memory footprints and all profile options, refer to the ASR support matrix.

By Language#

Pick a model based on the languages you need to support:

Language Requirement

Supported Models

English

Parakeet CTC English (streaming + offline), Nemotron ASR Streaming (streaming only), or Parakeet TDT v2 (offline with word-level timestamps).

A single non-English language

Parakeet CTC variants cover Spanish, Mandarin, Vietnamese, and Taiwanese — each optimized for its target language plus English code-switching. Conformer CTC Spanish is an alternate Spanish option. Refer to Disambiguating Overlapping Models.

Many languages, one deployment

Parakeet RNNT Multilingual (25+ languages, auto-detect, streaming + offline), Parakeet TDT v3 (25 European languages, offline), or Canary 1b (26 languages, offline).

Maximum language coverage

Whisper Large v3 supports 100+ languages (offline only).

By Inference Mode#

Pick a model based on how you plan to process audio:

Inference Mode

Supported Models

Real-time streaming

Parakeet CTC variants, Parakeet RNNT Multilingual, Conformer CTC Spanish, or Nemotron ASR Streaming.

Offline only (complete audio, one-shot)

Parakeet TDT, Whisper Large v3, Canary 1b.

High-throughput streaming

Models that expose a mode=str-thr profile — Parakeet CTC variants, Parakeet RNNT Multilingual, and Conformer CTC Spanish.

By Capability#

Pick a model based on additional features you need:

Capability

Supported Models

Word-level timestamps

Parakeet TDT and Parakeet RNNT return start and end times for each word.

Translation

Whisper Large v3 translates any supported language to English. Canary 1b supports bidirectional translation across 26 languages.

Speaker diarization

Models that expose a diarizer=sortformer profile — Parakeet CTC variants and Parakeet RNNT Multilingual.

Automatic punctuation and capitalization

All models produce punctuated and capitalized text. Refer to each model’s deploy page for defaults and activation flags.

Disambiguating Overlapping Models#

Use the following table to pick between models that cover similar use cases:

If You Need…

Choose

Alternative

Why

English streaming transcription with the option to also run offline or enable diarization

Parakeet CTC English

Nemotron ASR Streaming

Parakeet CTC exposes streaming, offline, and throughput profiles plus a diarizer=sortformer option. Nemotron ASR Streaming ships a single streaming-only profile.

A dedicated single-profile English streaming model

Nemotron ASR Streaming

Parakeet CTC English

Use when you want a purpose-built streaming-only deployment and do not need offline or diarization modes.

Spanish transcription with diarization, throughput mode, or English code-switching

Parakeet CTC Spanish

Conformer CTC Spanish

Parakeet CTC Spanish exposes diarizer=sortformer profiles and supports Spanish plus English code-switching.

Spanish-only transcription with a smaller streaming footprint

Conformer CTC Spanish

Parakeet CTC Spanish

Conformer CTC Spanish streaming profiles (mode=str, mode=str-thr) have smaller GPU footprints than the Parakeet equivalents.

Multilingual European transcription with word-level timestamps

Parakeet TDT v3

Parakeet RNNT Multilingual

TDT v3 returns word-level timestamps (offline only). Choose RNNT Multilingual if you need streaming or broader language coverage.

Next Steps#