Is this page helpful?

About NVIDIA TTS NIM Microservice#

The NVIDIA Text-to-Speech (TTS) NIM microservice synthesizes natural-sounding speech from text. It packages pre-trained NeMo models with the full NVIDIA inference stack into self-contained containers that handle model download, optimization, and serving.

The NVIDIA TTS NIM microservice supports two synthesis modes:

Offline: Generates the complete audio and returns it in a single response. Simpler to use, but subject to a 4 MB gRPC message size limit.
Streaming: Returns audio in chunks as they are generated. Provides lower time-to-first-audio and handles arbitrarily long text.

Available Models#

Model	Languages	Modes	Key Capability
Magpie TTS Multilingual	English, Spanish, French, German, Mandarin, Vietnamese, Italian, Hindi, Japanese	Streaming + Offline	Multi-voice, multi-language synthesis
Magpie TTS Zeroshot	English	Streaming + Offline	Voice cloning from a reference audio sample
Magpie TTS Flow	English	Offline	Voice cloning with audio prompt and transcript
Chatterbox TTS Multilingual	23 languages incl. Arabic, English, Spanish, French, German, Hindi, Japanese, Korean, Mandarin, Portuguese, Russian	Streaming + Offline	Emotion exaggeration

Note

Magpie TTS Zeroshot and Magpie TTS Flow require access approval. Request access through this form.

For GPU requirements and all model profiles, refer to the TTS support matrix.

Key Capabilities#

Voice Selection and Emotional Styles#

Each TTS model includes multiple voices organized by language and locale. The Magpie Multilingual model provides voices with optional emotional style variants, such as Magpie-Multilingual.EN-US.Aria.Happy and Magpie-Multilingual.FR-FR.Pascal.Calm. Refer to Voices and Emotional Styles for the naming convention and available emotions.

Voice Cloning#

The Magpie TTS Zeroshot and Magpie TTS Flow models clone a voice from a short reference audio recording (3–10 seconds, 16-bit mono WAV at 22.05 kHz or higher). Magpie TTS Flow also requires a transcript of the audio prompt. Refer to Voice Cloning for audio prompt preparation, quality tuning, and examples.

Batch Synthesis#

The WebSocket realtime client supports synthesizing speech from text files and processing multiple lines in parallel for higher throughput. Refer to Batch Synthesis for details.

SSML Customization#

The TTS NIM microservice supports a subset of Speech Synthesis Markup Language (SSML) for controlling phoneme pronunciation. Refer to Customizing TTS Models for details and examples.

Custom Pronunciation#

Define a text-based dictionary that maps words to IPA phonetic representations. The TTS model applies these custom pronunciations during synthesis. Refer to Phoneme Support for supported phonemes.

Next Steps#

TTS Tutorial: Deploy your first TTS NIM and synthesize speech step by step.
Deploy and Run TTS NIM: Docker and Helm deployment with inference examples for all models and APIs.
Voices and Emotional Styles: Voice naming convention, speakers, and emotional variants.
Voice Cloning: Clone a voice from a reference audio recording.
Batch Synthesis: Synthesize from text files with parallel processing.
Customizing TTS Models: SSML tags and custom pronunciation dictionaries.
Phoneme Support: Supported IPA phonemes for English.
TTS Support Matrix: GPU requirements and model profiles.