country_code
Skip to main content
Ctrl+K
NeMo-Speech - Home NeMo-Speech - Home

NeMo-Speech

  • GitHub
NeMo-Speech - Home NeMo-Speech - Home

NeMo-Speech

  • GitHub

Table of Contents

Getting Started

  • Installation
  • NeMo Speech Inference in 5 Minutes
  • Key Concepts in Speech AI
  • Choosing a Model
  • Tutorials

Training

  • Parallelisms
  • Mixed Precision Training

Model Checkpoints

  • Checkpoints

Lhotse Dataloading

  • Lhotse Dataloading

Collections

  • Automatic Speech Recognition (ASR)
    • Featured Models
    • ASR Model Checkpoints
    • Inference
    • Fine-Tuning
    • Datasets
    • ASR Language Modeling and Customization
      • NGPU-LM (GPU-based N-gram Language Model) Language Model Fusion
      • Neural Rescoring
      • N-gram Language Model Fusion
      • Scripts for building and merging N-gram Language Models
      • Word Boosting
    • NeMo ASR Configuration Files
    • NeMo ASR API
  • Text-to-Speech (TTS)
    • Models
    • Data Preprocessing
    • Checkpoints
    • NeMo TTS Configuration Files
    • Grapheme-to-Phoneme Models
    • Magpie-TTS
    • Magpie-TTS Finetuning
    • Magpie-TTS Preference Optimization
    • Magpie-TTS Longform Inference
  • SpeechLM2
    • Models
    • Datasets
    • Configuration Files
    • Training and Scaling
  • Speaker Diarization
    • Models
    • Datasets
    • Checkpoints
    • Speaker Diarization Configuration Files
    • NeMo Speaker Diarization API
    • Resource and Documentation Guide
  • Speaker Recognition (SR)
    • Models
    • NeMo Speaker Recognition Configuration Files
    • Datasets
    • Checkpoints
    • NeMo Speaker Recognition API
    • Resource and Documentation Guide
  • Speech and Audio Processing
    • Models
    • Datasets
    • Checkpoints
    • NeMo Audio Configuration Files
    • NeMo Audio API
  • Speech Self-Supervised Learning
    • Models
    • Datasets
    • Checkpoints
    • NeMo SSL Configuration Files
    • NeMo SSL collection API
    • Resources and Documentation
  • Speech Classification
    • Models
    • Datasets
    • Checkpoints
    • NeMo Speech Classification Configuration Files
    • Resource and Documentation Guide

Speech AI Tools

  • NeMo Forced Aligner (NFA)
  • Dataset Creation Tool Based on CTC-Segmentation
  • Speech Data Explorer
  • Comparison tool for ASR Models
  • ASR Evaluator
  • Speech Data Processor

APIs

  • NeMo Models
  • Neural Modules
  • Experiment Manager
  • Neural Types
  • Adapters
    • Adapter Components
    • Adapters API
  • NeMo Core APIs
  • NeMo Common Collection API
    • Callbacks
    • Losses
    • Metrics
    • Tokenizers
    • Data
    • S3 Checkpointing
  • NeMo ASR API
  • NeMo TTS API
  • NeMo Audio API
  • Automatic Speech Recognition (ASR)
  • Featured Models
Is this page helpful?

Featured Models#

NeMo’s ASR collection supports several model architectures. This page covers the key model families and their capabilities. For pretrained checkpoints, see All Checkpoints. For config file details, see Configuration Files.

Parakeet#

Parakeet is a family of ASR models with a FastConformer Encoder and CTC, RNN-T, or TDT decoders.

  • Parakeet-TDT-0.6B V3 — 25 languages, PnC, blazing fast

  • Parakeet-TDT-0.6B V2 — English-only, PnC, blazing fast

  • Parakeet-TDT/CTC-110M — Edge deployment

  • Nemotron-Speech-Streaming — Real-time streaming

  • Multitalker-Parakeet — Multi-speaker streaming

Canary#

Canary models are encoder-decoder models with a FastConformer Encoder and Transformer Decoder [ASR-MODELS2]. They support ASR in 25 EU languages, speech translation (AST), and punctuation/capitalization (PnC).

  • Canary-1B V2 — Flagship: 25 languages, PnC, timestamps

  • Canary-Qwen-2.5B — English only, PnC, highest accuracy

  • Canary-1B Flash / 180M Flash — Optimized for speed

Canary supports chunked and streaming inference.

Conformer#

The Conformer [ASR-MODELS1] combines self-attention and convolution modules. NeMo supports CTC, Transducer, and HAT variants.

  • Conformer-CTC: Non-autoregressive, uses EncDecCTCModelBPE

  • Conformer-Transducer: Autoregressive, uses EncDecRNNTBPEModel

  • Conformer-HAT: Separates labels and blank predictions for better external LM integration (paper)

Configs: examples/asr/conf/conformer/

Fast-Conformer#

Fast Conformer has 8x depthwise convolutional subsampling and reduced kernel sizes, making it ~2.4x faster than standard Conformer with minimal quality loss. Supports Longformer-style local attention for audio >1 hour.

Configs: examples/asr/conf/fastconformer/

Cache-aware Streaming Conformer#

Streaming models trained with limited right context for real-time inference with caching to avoid duplicate computation. Supports three modes: fully causal, regular look-ahead, and chunk-aware look-ahead (recommended).

  • Tutorial notebook

  • Simulation script: examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py

  • Supports multiple look-aheads with att_context_size lists

Configs: examples/asr/conf/fastconformer/cache_aware_streaming/

With Prompt Conditioning (RNN-T only): Cache-aware streaming RNN-T model with language-ID prompt conditioning for multilingual ASR via EncDecRNNTBPEModelWithPrompt. The streaming inference script accepts a target_lang flag to select the prompt at runtime (see RNN-T with Prompt Conditioning Configuration). Config: fastconformer_transducer_bpe_streaming_prompt.yaml

Multitalker Streaming#

Streaming multi-speaker ASR based on cache-aware FastConformer with speaker kernel injection [ASR-MODELS3]. Deploys one model instance per speaker for robust transcription of overlapped speech.

  • Model card

  • Tutorial

Hybrid-Transducer-CTC#

Models with both RNN-T and CTC decoders trained jointly. Switch at inference time via asr_model.change_decoding_strategy(decoder_type='ctc' or 'rnnt').

  • EncDecHybridRNNTCTCBPEModel (BPE) / EncDecHybridRNNTCTCModel (char)

  • Configs: examples/asr/conf/fastconformer/hybrid_transducer_ctc/

With Prompt Conditioning: Extends Hybrid models with learnable prompt embeddings for multilingual/multi-domain ASR via EncDecHybridRNNTCTCBPEModelWithPrompt. Config: fastconformer_hybrid_transducer_ctc_bpe_prompt.yaml

References#

[ASR-MODELS1]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and others. Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.

[ASR-MODELS2]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 6000–6010. 2017.

[ASR-MODELS3]

Weiqing Wang, Taejin Park, Ivan Medennikov, Jinhan Wang, Kunal Dhawan, He Huang, Nithin Rao Koluguri, Jagadeesh Balam, and Boris Ginsburg. Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR. In Interspeech 2025, 5498–5502. 2025. doi:10.21437/Interspeech.2025-2142.

previous

Automatic Speech Recognition (ASR)

next

ASR Model Checkpoints

On this page
  • Parakeet
  • Canary
    • Conformer
    • Fast-Conformer
    • Cache-aware Streaming Conformer
    • Multitalker Streaming
    • Hybrid-Transducer-CTC
    • References
NVIDIA NVIDIA
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.