ASR Model Checkpoints#

This page lists all supported ASR model checkpoints released by NVIDIA NeMo. Benchmark scores for each model can be found on its HuggingFace model card.

Glossary#

Term

Definition

ASR

Automatic Speech Recognition — transcribing speech to text

AST

Automatic Speech Translation — translating speech to text from one language to another

AED

Attention Encoder-Decoder — autoregressive decoder using cross-attention (Canary family)

CTC

Connectionist Temporal Classification — non-autoregressive decoder

RNN-T

Recurrent Neural Network Transducer — autoregressive streaming-friendly decoder

TDT

Token-and-Duration Transducer — extends RNN-T with duration prediction for faster inference

Hybrid

Joint RNN-T + CTC model — both decoders trained together, either usable at inference

PnC

Punctuation and Capitalization in the output

SALM

Speech Augmented Language Model — combines a speech encoder with a large language model

Streaming

Real-time / cache-aware inference capability

EU4

English, German, Spanish, French

EU25

English, German, Spanish, French, Italian, Polish, Portuguese, Dutch, Russian, Ukrainian, Belarusian, Croatian, Czech, Bulgarian, Danish, Estonian, Finnish, Greek, Hungarian, Latvian, Lithuanian, Maltese, Romanian, Slovak, Slovenian, Swedish

Canary Models (AED)#

Multi-task encoder-decoder models supporting ASR, AST, PnC, and timestamps across multiple languages.

Model

Decoder

Capabilities

Language

Size

canary-1b-v2

AED

ASR, AST, PnC, timestamps

EU25

1B

canary-qwen-2.5b

SALM

ASR, AST, PnC, timestamps

EU25

2.5B

canary-1b-flash

AED

ASR, AST, PnC, timestamps, fast

EU4

1B

canary-180m-flash

AED

ASR, AST, PnC, timestamps, fast

EU4

180M

canary-1b

AED

ASR, AST, PnC

EU4

1B

Parakeet Models#

High-accuracy ASR models built on the FastConformer encoder architecture. Parakeet, Nemotron Speech, and the stt_*_fastconformer_* models below all share the same underlying FastConformer encoder; the different names reflect release branding, not architectural differences.

Model

Decoder

Capabilities

Language

Size

parakeet-tdt-0.6b-v3

TDT

ASR, PnC, timestamps

English

0.6B

parakeet-tdt-0.6b-v2

TDT

ASR, PnC, timestamps

English

0.6B

parakeet-tdt-1.1b

TDT

ASR, timestamps

English

1.1B

parakeet-tdt_ctc-1.1b

Hybrid TDT+CTC

ASR, timestamps

English

1.1B

parakeet-tdt_ctc-0.6b-ja

Hybrid TDT+CTC

ASR, timestamps

Japanese

0.6B

parakeet-tdt_ctc-110m

Hybrid TDT+CTC

ASR, timestamps

English

110M

parakeet-rnnt-1.1b

RNN-T

ASR, timestamps

English

1.1B

parakeet-rnnt-0.6b

RNN-T

ASR, timestamps

English

0.6B

parakeet-ctc-1.1b

CTC

ASR

English

1.1B

parakeet-ctc-0.6b

CTC

ASR

English

0.6B

parakeet-ctc-0.6b-Vietnamese

CTC

ASR

Vietnamese

0.6B

parakeet-rnnt-110m-da-dk

RNN-T

ASR

Danish

110M

Streaming Models#

Cache-aware models for real-time / low-latency inference.

Model

Decoder

Capabilities

Language

Size

nemotron-speech-streaming-en-0.6b

Hybrid

ASR, streaming

English

0.6B

multitalker-parakeet-streaming-0.6b-v1

RNN-T

ASR, multitalker, streaming

English

0.6B

parakeet_realtime_eou_120m-v1

RNN-T

ASR, end-of-utterance, streaming

English

120M

stt_en_fastconformer_hybrid_large_streaming_multi

Hybrid

ASR, streaming, multiple look-aheads

English

Large

stt_en_fastconformer_hybrid_medium_streaming_80ms_pc

Hybrid

ASR, PnC, streaming

English

Medium

stt_en_fastconformer_hybrid_medium_streaming_80ms

Hybrid

ASR, streaming

English

Medium

stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc

Hybrid

ASR, PnC, streaming

Georgian

Large

stt_en_fastconformer_hybrid_large_streaming_1040ms

Hybrid

ASR, streaming

English

Large

FastConformer English Models (Non-Streaming)#

Model

Decoder

Capabilities

Language

Size

stt_en_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

English

Large

stt_en_fastconformer_ctc_large

CTC

ASR

English

Large

stt_en_fastconformer_ctc_xlarge

CTC

ASR

English

XLarge

stt_en_fastconformer_ctc_xxlarge

CTC

ASR

English

XXLarge

stt_en_fastconformer_transducer_large

RNN-T

ASR

English

Large

stt_en_fastconformer_transducer_xlarge

RNN-T

ASR

English

XLarge

stt_en_fastconformer_transducer_xxlarge

RNN-T

ASR

English

XXLarge

stt_en_fastconformer_tdt_large

TDT

ASR

English

Large

FastConformer Multilingual Models#

Single-language FastConformer Hybrid models. Models with _pc suffix support punctuation and capitalization.

Model

Decoder

Capabilities

Language

Size

stt_multilingual_fastconformer_hybrid_large_pc_blend_eu

Hybrid

ASR, PnC

Multilingual EU

Large

stt_de_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

German

Large

stt_es_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

Spanish

Large

stt_es_fastconformer_hybrid_large_pc_nc

Hybrid

ASR, Punctuation only

Spanish

Large

stt_fr_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

French

Large

stt_it_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

Italian

Large

stt_ru_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

Russian

Large

stt_ua_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

Ukrainian

Large

stt_pl_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

Polish

Large

stt_hr_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

Croatian

Large

stt_be_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

Belarusian

Large

stt_nl_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

Dutch

Large

stt_pt_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

Portuguese

Large

stt_fa_fastconformer_hybrid_large

Hybrid

ASR

Farsi

Large

stt_ka_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

Georgian

Large

stt_hy_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

Armenian

Large

stt_ar_fastconformer_hybrid_large_pc_v1.0

Hybrid

ASR, PnC

Arabic

Large

stt_ar_fastconformer_hybrid_large_pcd_v1.0

Hybrid

ASR, PnC (diacritized)

Arabic

Large

stt_uz_fastconformer_hybrid_large_pc

Hybrid

ASR, PnC

Uzbek

Large

stt_kk_ru_fastconformer_hybrid_large

Hybrid

ASR

Kazakh + Russian

Large

Loading Models#

All models (except SALM — see SpeechLM2) can be loaded via the from_pretrained() API:

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")