NVIDIA NeMo Speech Developer Docs#
NVIDIA NeMo Speech is an open-source toolkit for speech, audio, and multimodal language model research, with a clear path from experimentation to production deployment.
Models#
ASR: Parakeet, Canary, FastConformer – with CTC, Transducer, TDT, and hybrid decoders
TTS: MagpieTTS, FastPitch + HiFi-GAN – multi-language, multi-speaker
Speaker: Sortformer streaming diarization, TitaNet speaker recognition, MarbleNet VAD
Audio: Speech enhancement, source separation, neural audio codecs
SpeechLM2: Canary-Qwen 2.5B (SALM), Duplex Speech-to-Speech – HuggingFace Transformers backbone integration
Inference & Deployment#
Streaming and real-time ASR with cache-aware Conformer
GPU-accelerated decoding with NGPU-LM language model fusion
Export to ONNX
Voice Agent#
Open-source conversational agent framework built on Pipecat
Streaming STT + LLM + TTS pipeline with natural turn-taking
Live speaker diarization and tool calling support
NeMo is built for researchers and engineers. Each collection provides prebuilt, modular components that can be customized, extended, and composed – from rapid prototyping to multi-node training to production inference.
NVIDIA NeMo Toolkit has separate collections for:
For quick guides and tutorials, see the “Getting started” section below.
Getting Started
For more information, browse the developer docs for your area of interest in the contents section below or on the left sidebar.
Training
Model Checkpoints
APIs
Collections
Speech AI Tools