Featured Models#
NeMo’s ASR collection supports several model architectures. This page covers the key model families and their capabilities. For pretrained checkpoints, see All Checkpoints. For config file details, see Configuration Files.
Parakeet#
Parakeet is a family of ASR models with a FastConformer Encoder and CTC, RNN-T, or TDT decoders.
Parakeet-TDT-0.6B V3 — 25 languages, PnC, blazing fast
Parakeet-TDT-0.6B V2 — English-only, PnC, blazing fast
Parakeet-TDT/CTC-110M — Edge deployment
Nemotron-Speech-Streaming — Real-time streaming
Multitalker-Parakeet — Multi-speaker streaming
Canary#
Canary models are encoder-decoder models with a FastConformer Encoder and Transformer Decoder [ASR-MODELS2]. They support ASR in 25 EU languages, speech translation (AST), and punctuation/capitalization (PnC).
Canary-1B V2 — Flagship: 25 languages, PnC, timestamps
Canary-Qwen-2.5B — English only, PnC, highest accuracy
Canary-1B Flash / 180M Flash — Optimized for speed
Canary supports chunked and streaming inference.
Conformer#
The Conformer [ASR-MODELS1] combines self-attention and convolution modules. NeMo supports CTC, Transducer, and HAT variants.
Conformer-CTC: Non-autoregressive, uses
EncDecCTCModelBPEConformer-Transducer: Autoregressive, uses
EncDecRNNTBPEModelConformer-HAT: Separates labels and blank predictions for better external LM integration (paper)
Configs: examples/asr/conf/conformer/
Fast-Conformer#
Fast Conformer has 8x depthwise convolutional subsampling and reduced kernel sizes, making it ~2.4x faster than standard Conformer with minimal quality loss. Supports Longformer-style local attention for audio >1 hour.
Configs: examples/asr/conf/fastconformer/
Cache-aware Streaming Conformer#
Streaming models trained with limited right context for real-time inference with caching to avoid duplicate computation. Supports three modes: fully causal, regular look-ahead, and chunk-aware look-ahead (recommended).
Simulation script:
examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.pySupports multiple look-aheads with
att_context_sizelists
Configs: examples/asr/conf/fastconformer/cache_aware_streaming/
With Prompt Conditioning (RNN-T only): Cache-aware streaming RNN-T model with language-ID prompt conditioning for multilingual ASR via
EncDecRNNTBPEModelWithPrompt. The streaming inference
script accepts a target_lang flag to select the prompt at runtime
(see RNN-T with Prompt Conditioning Configuration).
Config: fastconformer_transducer_bpe_streaming_prompt.yaml
Multitalker Streaming#
Streaming multi-speaker ASR based on cache-aware FastConformer with speaker kernel injection [ASR-MODELS3]. Deploys one model instance per speaker for robust transcription of overlapped speech.
Hybrid-Transducer-CTC#
Models with both RNN-T and CTC decoders trained jointly. Switch at inference time via asr_model.change_decoding_strategy(decoder_type='ctc' or 'rnnt').
EncDecHybridRNNTCTCBPEModel(BPE) /EncDecHybridRNNTCTCModel(char)Configs:
examples/asr/conf/fastconformer/hybrid_transducer_ctc/
With Prompt Conditioning: Extends Hybrid models with learnable prompt embeddings for multilingual/multi-domain ASR via EncDecHybridRNNTCTCBPEModelWithPrompt. Config: fastconformer_hybrid_transducer_ctc_bpe_prompt.yaml
References#
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and others. Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 6000–6010. 2017.
Weiqing Wang, Taejin Park, Ivan Medennikov, Jinhan Wang, Kunal Dhawan, He Huang, Nithin Rao Koluguri, Jagadeesh Balam, and Boris Ginsburg. Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR. In Interspeech 2025, 5498–5502. 2025. doi:10.21437/Interspeech.2025-2142.