NVIDIA NeMo Speech Developer Docs#

NVIDIA NeMo Speech is an open-source toolkit for speech, audio, and multimodal language model research, with a clear path from experimentation to production deployment.

🎙️ Transcribe Speech (ASR)

Convert audio to text with state-of-the-art accuracy. Supports 14+ languages, streaming, and timestamps.

Quick Start →

🔊 Synthesize Speech (TTS)

Generate natural human speech from text. Multi-language, multi-speaker, with controllable prosody.

Quick Start →

👥 Identify Speakers

Determine "who spoke when" in multi-speaker audio. Speaker diarization, recognition, and verification.

Quick Start →

🧠 Speech Language Models

Audio-aware LLMs that understand and generate speech. Use HuggingFace Transformers, or NeMo Automodel for efficient MoE and model parallelism. Speech-to-text, speech-to-speech, and more.

Quick Start →

🎧 Process Audio

Enhance, restore, and separate audio signals. Improve audio quality for downstream tasks.

Quick Start →

🛠️ Speech AI Tools

Forced alignment, data exploration, CTC segmentation, and evaluation utilities for speech workflows.

Explore Tools →

What is NeMo?#

NVIDIA NeMo is an open-source toolkit for building, customizing, and deploying speech, audio, and multimodal language models. It provides:

Pretrained models — production-ready checkpoints on NGC and HuggingFace Hub
Modular architecture — neural modules you can mix, match, and extend
Scalable training — multi-GPU/multi-node via PyTorch Lightning with mixed-precision support
Simple configuration — YAML-based experiment configs with Hydra

Get started (install the PyTorch build for your platform first):

uv pip install 'nemo-toolkit[asr,tts]'

import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
print(model.transcribe(["audio.wav"])[0].text)

Trying to finetune a model?#

Check out our latest /nemo-speech-finetune-asr agent skill.

Getting Started

Training

Collections

Speech AI Tools

APIs