Choosing a Model#
NeMo offers many pretrained speech models. This guide helps you pick the right one for your use case.
ASR: Which Model Should I Use?#
I want to… |
Recommended Model |
Why |
|---|---|---|
Get the best accuracy on English |
State-of-the-art English ASR. For very fast offline alternatives with almost SOTA accuracy, use Parakeet-TDT V2 or Parakeet-TDT V3. |
|
Transcribe multiple languages |
Supports 25 EU languages + translation between them. AED decoder. |
|
Transcribe European languages (ASR only, auto language detection) |
25 European languages in one model; automatic language detection; punctuation, capitalization, and word/segment timestamps; long-form and streaming options. No speech-to-text translation—use Canary-1B V2 if you need translation. |
|
Stream audio in real-time |
Low-latency streaming English ASR with configurable chunk sizes. Cache-aware FastConformer + RNN-T. |
|
Minimize model size |
Smallest multilingual model. Good for edge deployment. |
|
Use CTC decoding (simpler pipeline) |
Non-autoregressive. Fast. Good with external language models. |
|
Integrate with an external LM |
Any Parakeet model + NGPU-LM |
GPU-accelerated n-gram LM fusion for CTC, RNNT, and TDT models. |
Transcribe multi-speaker meetings |
Handles overlapping speech in real-time with speaker-adapted decoding. |
TTS: Which Model Should I Use?#
I want to… |
Recommended Model |
Why |
|---|---|---|
Generate high-quality multilingual speech |
End-to-end LLM-based TTS. Supports voice cloning and multiple languages. |
|
Fast, controllable English synthesis |
Cascaded pipeline with pitch/duration control. Well-tested. |
|
Generate discrete audio tokens |
Audio Codec |
Neural audio codec for tokenizing audio. Used by MagpieTTS internally. |
Speaker Tasks: Which Model Should I Use?#
I want to… |
Recommended Model |
Why |
|---|---|---|
Determine who spoke when |
End-to-end diarization for up to 4 speakers. Use streaming for real-time; use offline for batch. |
|
Verify/identify a speaker |
Extracts speaker embeddings for verification and identification. |
|
Detect voice activity |
Frame-level VAD. Multilingual. Works as a preprocessing step. |
Speech Language Models: Which Model Should I Use?#
I want to… |
Recommended Model |
Why |
|---|---|---|
Ask questions about audio content |
Canary-Qwen 2.5B (SALM) |
LLM augmented with speech understanding. Can transcribe, translate, and answer questions about audio. |
Build a speech-to-speech system |
DuplexS2SModel |
Full-duplex model that both understands and generates speech. |
Decision Flowchart#
What do you want to do?
│
├─ Transcribe speech to text (ASR)
│ ├─ Best accuracy on English? → Canary-Qwen 2.5B (or Parakeet-TDT V2/V3 for fast offline)
│ ├─ Multiple languages + translation? → Canary-1B V2
│ ├─ European multilingual ASR (auto LID)? → Parakeet-TDT 0.6B V3
│ ├─ Stream audio in real-time? → Nemotron-Speech-Streaming
│ └─ Multi-speaker meeting? → Multitalker Parakeet Streaming
│
├─ Generate speech from text (TTS)
│ ├─ Multilingual / voice cloning? → MagpieTTS
│ └─ English with pitch control? → FastPitch + HiFi-GAN
│
├─ Identify speakers
│ ├─ Who spoke when? → Streaming Sortformer or Offline Sortformer
│ └─ Verify identity? → TitaNet
│
├─ Enhance audio quality → See Audio Processing models
│
└─ Speech-aware LLM → Canary-Qwen 2.5B (SALM)
Where to Find Models#
All pretrained NeMo models are available on:
HuggingFace Hub (nvidia) — search for “nemo” or specific model names
NGC Model Catalog — NVIDIA’s model registry
See Checkpoints for instructions on loading pretrained models.