NVIDIA NeMo Speech Developer Docs#

NVIDIA NeMo Speech is an open-source toolkit for speech, audio, and multimodal language model research, with a clear path from experimentation to production deployment.

Models#

Inference & Deployment#

  • Streaming and real-time ASR with cache-aware Conformer

  • GPU-accelerated decoding with NGPU-LM language model fusion

  • Export to ONNX

Voice Agent#

  • Open-source conversational agent framework built on Pipecat

  • Streaming STT + LLM + TTS pipeline with natural turn-taking

  • Live speaker diarization and tool calling support


NeMo is built for researchers and engineers. Each collection provides prebuilt, modular components that can be customized, extended, and composed – from rapid prototyping to multi-node training to production inference.

NVIDIA NeMo Toolkit has separate collections for:

For quick guides and tutorials, see the “Getting started” section below.

For more information, browse the developer docs for your area of interest in the contents section below or on the left sidebar.

Model Checkpoints

APIs

Collections

Speech AI Tools