Why NeMo?#

NeMo simplifies Speech AI development through its modular approach, providing neural modules – logical blocks of AI applications with typed inputs and outputs – that enable seamless model construction. This accelerates development, improves accuracy on domain-specific data, and promotes modularity, flexibility, and reusability within AI workflows.

Automatic Speech Recognition (ASR)#

NeMo provides state-of-the-art ASR models for a wide range of applications:

Parakeet family of models, including the leaderboard-topping Parakeet-TDT, built on the FastConformer encoder architecture
Canary multi-lingual ASR with support for translation and code-switching
FastConformer encoder with CTC, RNNT, and TDT decoder variants
GPU-accelerated decoding algorithms for real-time and batch transcription
Multi-language support, including English, Mandarin, German, French, Spanish, and more

Text-to-Speech (TTS)#

NeMo offers production-ready speech synthesis:

MagpieTTS for high-quality, multi-lingual speech generation
FastPitch spectrogram generator for fast, controllable synthesis
HiFi-GAN neural vocoder for high-fidelity audio waveform generation
Multi-language and multi-speaker support

Speaker Tasks#

NeMo includes models for speaker-related tasks:

Sortformer for streaming speaker diarization – determining “who spoke when” in multi-speaker audio
Speaker recognition and verification models
Speaker embedding extraction

Speech Language Models (SpeechLM2)#

NeMo’s SpeechLM2 collection enables speech-aware language models:

SALM (Speech-Augmented Language Models), powering models like Canary-Qwen 2.5B
Duplex Speech-to-Speech for real-time conversational AI
Integration with HuggingFace Transformers for backbone LLMs

Audio Processing#

NeMo provides tools for audio signal processing:

Speech enhancement for improving audio quality
Source separation for isolating individual speakers or sounds

Training and Tools#

NeMo provides a comprehensive set of training utilities and tools:

Multi-GPU and multi-node training via PyTorch Lightning
Mixed precision training (FP16, BF16) for faster training with lower memory usage
NeMo Forced Aligner for aligning audio with transcripts at word and segment level
Speech Data Explorer for interactive exploration and analysis of ASR/TTS datasets
CTC Segmentation for creating training data from long audio files with transcripts