NVIDIA NeMo Speech Developer Docs#

NVIDIA NeMo Speech is an open-source toolkit for speech, audio, and multimodal language model research, with a clear path from experimentation to production deployment.

🎙️ Transcribe Speech (ASR)

Convert audio to text with state-of-the-art accuracy. Supports 14+ languages, streaming, and timestamps.

Quick Start →

🔊 Synthesize Speech (TTS)

Generate natural human speech from text. Multi-language, multi-speaker, with controllable prosody.

Quick Start →

👥 Identify Speakers

Determine "who spoke when" in multi-speaker audio. Speaker diarization, recognition, and verification.

Quick Start →

🧠 Speech Language Models

Audio-aware LLMs that understand and generate speech. Speech-to-text, speech-to-speech, and more.

Quick Start →

🎧 Process Audio

Enhance, restore, and separate audio signals. Improve audio quality for downstream tasks.

Quick Start →

🛠️ Speech AI Tools

Forced alignment, data exploration, CTC segmentation, and evaluation utilities for speech workflows.

Explore Tools →

What is NeMo?#

NVIDIA NeMo is an open-source toolkit for building, customizing, and deploying speech, audio, and multimodal language models. It provides:

Pretrained models — production-ready checkpoints on NGC and HuggingFace Hub
Modular architecture — neural modules you can mix, match, and extend
Scalable training — multi-GPU/multi-node via PyTorch Lightning with mixed-precision support
Simple configuration — YAML-based experiment configs with Hydra

Get started in 30 seconds:

pip install nemo_toolkit[asr,tts]

import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
print(model.transcribe(["audio.wav"])[0].text)

Getting Started

Training

Model Checkpoints

Checkpoints

APIs

NeMo APIs

Collections

NeMo Collections

Speech AI Tools

Speech AI Tools