Is this page helpful?

Choosing a Model#

NeMo offers many pretrained speech models. This guide helps you pick the right one for your use case.

ASR: Which Model Should I Use?#

I want to…	Recommended Model	Why
Get the best accuracy on English	Canary-Qwen 2.5B	State-of-the-art English ASR. For very fast offline alternatives with almost SOTA accuracy, use Parakeet-TDT V2 or Parakeet-TDT V3.
Transcribe multiple languages	Canary-1B V2	Supports 25 EU languages + translation between them. AED decoder.
Transcribe European languages (ASR only, auto language detection)	Parakeet-TDT 0.6B V3	25 European languages in one model; automatic language detection; punctuation, capitalization, and word/segment timestamps; long-form and streaming options. No speech-to-text translation—use Canary-1B V2 if you need translation.
Stream audio in real-time	Nemotron-3.5-ASR-Streaming	Low-latency streaming ASR with 40 languages, controllable latency (80ms–1s), and configurable chunk sizes. Cache-aware FastConformer.
Minimize model size	Canary-180M Flash	Smallest multilingual model. Good for edge deployment.
Use CTC decoding (simpler pipeline)	Parakeet-CTC-1.1B	Non-autoregressive. Fast. Good with external language models.
Integrate with an external LM	Any Parakeet model + NGPU-LM	GPU-accelerated n-gram LM fusion for CTC, RNNT, and TDT models.
Transcribe multi-speaker meetings	Multitalker Parakeet Streaming	Handles overlapping speech in real-time with speaker-adapted decoding.

TTS: Which Model Should I Use?#

I want to…	Recommended Model	Why
Generate high-quality multilingual speech	MagpieTTS	End-to-end LLM-based TTS. Supports voice cloning and multiple languages.
Fast, controllable English synthesis	FastPitch + HiFi-GAN	Cascaded pipeline with pitch/duration control. Well-tested.
Generate discrete audio tokens	Audio Codec	Neural audio codec for tokenizing audio. Used by MagpieTTS internally.

Speaker Tasks: Which Model Should I Use?#

I want to…	Recommended Model	Why
Determine who spoke when	Streaming Sortformer, Offline Sortformer	End-to-end diarization for up to 4 speakers. Use streaming for real-time; use offline for batch.
Verify/identify a speaker	TitaNet	Extracts speaker embeddings for verification and identification.
Detect voice activity	MarbleNet	Frame-level VAD. Multilingual. Works as a preprocessing step.

Speech Language Models: Which Model Should I Use?#

I want to…	Recommended Model	Why
Ask questions about audio content	Canary-Qwen 2.5B (SALM)	LLM augmented with speech understanding. Can transcribe, translate, and answer questions about audio.
Build a speech-to-speech system	DuplexS2SModel	Full-duplex model that both understands and generates speech.

Where to Find Models#

All pretrained NeMo models are available on:

HuggingFace Hub (nvidia) — search for “nemo” or specific model names
NGC Model Catalog — NVIDIA’s model registry
Featured Community Checkpoints — fine-tunes from external users

See Checkpoint Formats for instructions on loading pretrained models.