NVIDIA NeMo Speech Developer Docs#
NVIDIA NeMo Speech is an open-source toolkit for speech, audio, and multimodal language model research, with a clear path from experimentation to production deployment.
🎙️ Transcribe Speech (ASR)
Convert audio to text with state-of-the-art accuracy. Supports 14+ languages, streaming, and timestamps.
Quick Start →🔊 Synthesize Speech (TTS)
Generate natural human speech from text. Multi-language, multi-speaker, with controllable prosody.
Quick Start →👥 Identify Speakers
Determine "who spoke when" in multi-speaker audio. Speaker diarization, recognition, and verification.
Quick Start →🧠 Speech Language Models
Audio-aware LLMs that understand and generate speech. Speech-to-text, speech-to-speech, and more.
Quick Start →🎧 Process Audio
Enhance, restore, and separate audio signals. Improve audio quality for downstream tasks.
Quick Start →🛠️ Speech AI Tools
Forced alignment, data exploration, CTC segmentation, and evaluation utilities for speech workflows.
Explore Tools →What is NeMo?#
NVIDIA NeMo is an open-source toolkit for building, customizing, and deploying speech, audio, and multimodal language models. It provides:
Pretrained models — production-ready checkpoints on NGC and HuggingFace Hub
Modular architecture — neural modules you can mix, match, and extend
Scalable training — multi-GPU/multi-node via PyTorch Lightning with mixed-precision support
Simple configuration — YAML-based experiment configs with Hydra
Get started in 30 seconds:
pip install nemo_toolkit[asr,tts]
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
print(model.transcribe(["audio.wav"])[0].text)
Getting Started
Training
Model Checkpoints
APIs
Collections
Speech AI Tools