ASR Model Checkpoints#
This page lists all supported ASR model checkpoints released by NVIDIA NeMo. Benchmark scores for each model can be found on its HuggingFace model card.
Glossary#
Term |
Definition |
|---|---|
ASR |
Automatic Speech Recognition — transcribing speech to text |
AST |
Automatic Speech Translation — translating speech to text from one language to another |
AED |
Attention Encoder-Decoder — autoregressive decoder using cross-attention (Canary family) |
CTC |
Connectionist Temporal Classification — non-autoregressive decoder |
RNN-T |
Recurrent Neural Network Transducer — autoregressive streaming-friendly decoder |
TDT |
Token-and-Duration Transducer — extends RNN-T with duration prediction for faster inference |
Hybrid |
Joint RNN-T + CTC model — both decoders trained together, either usable at inference |
PnC |
Punctuation and Capitalization in the output |
SALM |
Speech Augmented Language Model — combines a speech encoder with a large language model |
Streaming |
Real-time / cache-aware inference capability |
EU4 |
English, German, Spanish, French |
EU25 |
English, German, Spanish, French, Italian, Polish, Portuguese, Dutch, Russian, Ukrainian, Belarusian, Croatian, Czech, Bulgarian, Danish, Estonian, Finnish, Greek, Hungarian, Latvian, Lithuanian, Maltese, Romanian, Slovak, Slovenian, Swedish |
Canary Models (AED)#
Multi-task encoder-decoder models supporting ASR, AST, PnC, and timestamps across multiple languages.
Model |
Decoder |
Capabilities |
Language |
Size |
|---|---|---|---|---|
AED |
ASR, AST, PnC, timestamps |
EU25 |
1B |
|
SALM |
ASR, AST, PnC, timestamps |
EU25 |
2.5B |
|
AED |
ASR, AST, PnC, timestamps, fast |
EU4 |
1B |
|
AED |
ASR, AST, PnC, timestamps, fast |
EU4 |
180M |
|
AED |
ASR, AST, PnC |
EU4 |
1B |
Parakeet Models#
High-accuracy ASR models built on the FastConformer encoder architecture.
Parakeet, Nemotron Speech, and the stt_*_fastconformer_* models below all share the same underlying FastConformer encoder;
the different names reflect release branding, not architectural differences.
Model |
Decoder |
Capabilities |
Language |
Size |
|---|---|---|---|---|
TDT |
ASR, PnC, timestamps |
English |
0.6B |
|
TDT |
ASR, PnC, timestamps |
English |
0.6B |
|
TDT |
ASR, timestamps |
English |
1.1B |
|
Hybrid TDT+CTC |
ASR, timestamps |
English |
1.1B |
|
Hybrid TDT+CTC |
ASR, timestamps |
Japanese |
0.6B |
|
Hybrid TDT+CTC |
ASR, timestamps |
English |
110M |
|
RNN-T |
ASR, timestamps |
English |
1.1B |
|
RNN-T |
ASR, timestamps |
English |
0.6B |
|
CTC |
ASR |
English |
1.1B |
|
CTC |
ASR |
English |
0.6B |
|
CTC |
ASR |
Vietnamese |
0.6B |
|
RNN-T |
ASR |
Danish |
110M |
Streaming Models#
Cache-aware models for real-time / low-latency inference.
Model |
Decoder |
Capabilities |
Language |
Size |
|---|---|---|---|---|
Hybrid |
ASR, streaming |
English |
0.6B |
|
RNN-T |
ASR, multitalker, streaming |
English |
0.6B |
|
RNN-T |
ASR, end-of-utterance, streaming |
English |
120M |
|
Hybrid |
ASR, streaming, multiple look-aheads |
English |
Large |
|
Hybrid |
ASR, PnC, streaming |
English |
Medium |
|
Hybrid |
ASR, streaming |
English |
Medium |
|
stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc |
Hybrid |
ASR, PnC, streaming |
Georgian |
Large |
Hybrid |
ASR, streaming |
English |
Large |
FastConformer English Models (Non-Streaming)#
Model |
Decoder |
Capabilities |
Language |
Size |
|---|---|---|---|---|
Hybrid |
ASR, PnC |
English |
Large |
|
CTC |
ASR |
English |
Large |
|
CTC |
ASR |
English |
XLarge |
|
CTC |
ASR |
English |
XXLarge |
|
RNN-T |
ASR |
English |
Large |
|
RNN-T |
ASR |
English |
XLarge |
|
RNN-T |
ASR |
English |
XXLarge |
|
TDT |
ASR |
English |
Large |
FastConformer Multilingual Models#
Single-language FastConformer Hybrid models. Models with _pc suffix support punctuation and capitalization.
Model |
Decoder |
Capabilities |
Language |
Size |
|---|---|---|---|---|
Hybrid |
ASR, PnC |
Multilingual EU |
Large |
|
Hybrid |
ASR, PnC |
German |
Large |
|
Hybrid |
ASR, PnC |
Spanish |
Large |
|
Hybrid |
ASR, Punctuation only |
Spanish |
Large |
|
Hybrid |
ASR, PnC |
French |
Large |
|
Hybrid |
ASR, PnC |
Italian |
Large |
|
Hybrid |
ASR, PnC |
Russian |
Large |
|
Hybrid |
ASR, PnC |
Ukrainian |
Large |
|
Hybrid |
ASR, PnC |
Polish |
Large |
|
Hybrid |
ASR, PnC |
Croatian |
Large |
|
Hybrid |
ASR, PnC |
Belarusian |
Large |
|
Hybrid |
ASR, PnC |
Dutch |
Large |
|
Hybrid |
ASR, PnC |
Portuguese |
Large |
|
Hybrid |
ASR |
Farsi |
Large |
|
Hybrid |
ASR, PnC |
Georgian |
Large |
|
Hybrid |
ASR, PnC |
Armenian |
Large |
|
Hybrid |
ASR, PnC |
Arabic |
Large |
|
Hybrid |
ASR, PnC (diacritized) |
Arabic |
Large |
|
Hybrid |
ASR, PnC |
Uzbek |
Large |
|
Hybrid |
ASR |
Kazakh + Russian |
Large |
Loading Models#
All models (except SALM — see SpeechLM2) can be loaded via the from_pretrained() API:
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")