Featured Models#

NeMo’s ASR collection supports several model architectures. This page covers the key model families and their capabilities. For pretrained checkpoints, see All Checkpoints. For config file details, see Configuration Files.

Parakeet#

Parakeet is a family of ASR models with a FastConformer Encoder and CTC, RNN-T, or TDT decoders.

Parakeet-TDT-0.6B V3 — 25 languages, PnC, blazing fast
Parakeet-TDT-0.6B V2 — English-only, PnC, blazing fast
Parakeet-TDT/CTC-110M — Edge deployment
Nemotron-3.5-ASR-Streaming — Real-time streaming, 40 languages
Multitalker-Parakeet — Multi-speaker streaming

Canary#

Canary models are encoder-decoder models with a FastConformer Encoder and Transformer Decoder [ASR-MODELS2]. They support ASR in 25 EU languages, speech translation (AST), and punctuation/capitalization (PnC).

Canary-1B V2 — Flagship: 25 languages, PnC, timestamps
Canary-Qwen-2.5B — English only, PnC, highest accuracy
Canary-1B Flash / 180M Flash — Optimized for speed

Canary supports chunked and streaming inference.

Conformer#

The Conformer [ASR-MODELS1] combines self-attention and convolution modules. NeMo supports CTC, Transducer, and HAT variants.

Conformer-CTC: Non-autoregressive, uses EncDecCTCModelBPE
Conformer-Transducer: Autoregressive, uses EncDecRNNTBPEModel
Conformer-HAT: Separates labels and blank predictions for better external LM integration (paper)

Configs: examples/asr/conf/conformer/

Fast-Conformer#

Fast Conformer has 8x depthwise convolutional subsampling and reduced kernel sizes, making it ~2.4x faster than standard Conformer with minimal quality loss. Supports Longformer-style local attention for audio >1 hour.

Configs: examples/asr/conf/fastconformer/

Cache-aware Streaming Conformer#

Streaming models trained with limited right context for real-time inference with caching to avoid duplicate computation. Supports three modes: fully causal, regular look-ahead, and chunk-aware look-ahead (recommended).

Tutorial notebook
Simulation script: examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py
Supports multiple look-aheads with att_context_size lists

Configs: examples/asr/conf/fastconformer/cache_aware_streaming/

With Prompt Conditioning (RNN-T only): Cache-aware streaming RNN-T model with language-ID prompt conditioning for multilingual ASR via EncDecRNNTBPEModelWithPrompt. The streaming inference script accepts a target_lang flag to select the prompt at runtime (see RNN-T with Prompt Conditioning Configuration). Config: fastconformer_transducer_bpe_streaming_prompt.yaml

Multitalker Streaming#

Streaming multi-speaker ASR based on cache-aware FastConformer with speaker kernel injection [ASR-MODELS3]. Deploys one model instance per speaker for robust transcription of overlapped speech.

Hybrid-Transducer-CTC#

Models with both RNN-T and CTC decoders trained jointly. Switch at inference time via asr_model.change_decoding_strategy(decoder_type='ctc' or 'rnnt').

EncDecHybridRNNTCTCBPEModel (BPE) / EncDecHybridRNNTCTCModel (char)
Configs: examples/asr/conf/fastconformer/hybrid_transducer_ctc/

With Prompt Conditioning: Extends Hybrid models with learnable prompt embeddings for multilingual/multi-domain ASR via EncDecHybridRNNTCTCBPEModelWithPrompt. Config: fastconformer_hybrid_transducer_ctc_bpe_prompt.yaml

References#

[ASR-MODELS1]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and others. Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.

[ASR-MODELS2]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 6000–6010. 2017.

[ASR-MODELS3]

Weiqing Wang, Taejin Park, Ivan Medennikov, Jinhan Wang, Kunal Dhawan, He Huang, Nithin Rao Koluguri, Jagadeesh Balam, and Boris Ginsburg. Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR. In Interspeech 2025, 5498–5502. 2025. doi:10.21437/Interspeech.2025-2142.