About NVIDIA TTS NIM Microservice#

The NVIDIA Text-to-Speech (TTS) NIM microservice synthesizes natural-sounding speech from text. It packages pre-trained NeMo models with the full NVIDIA inference stack into self-contained containers that handle model download, optimization, and serving.

The NVIDIA TTS NIM microservice supports two synthesis modes:

  • Offline: Generates the complete audio and returns it in a single response. Simpler to use, but subject to a 4 MB gRPC message size limit.

  • Streaming: Returns audio in chunks as they are generated. Provides lower time-to-first-audio and handles arbitrarily long text.

Available Models#

Model

Languages

Modes

Key Capability

Magpie TTS Multilingual

English, Spanish, French, German, Mandarin, Vietnamese, Italian

Streaming + Offline

Multi-voice, multi-language synthesis

Magpie TTS Zeroshot

English

Streaming + Offline

Voice cloning from a reference audio sample

Magpie TTS Flow

English

Offline

Voice cloning with audio prompt and transcript

Note

Magpie TTS Zeroshot and Magpie TTS Flow require access approval. Request access through this form.

For GPU requirements and all model profiles, refer to the TTS support matrix.

Key Capabilities#

Voice Selection and Emotional Styles#

Each TTS model includes multiple voices organized by language and locale. The Magpie Multilingual model provides voices with optional emotional style variants, such as Magpie-Multilingual.EN-US.Aria.Happy and Magpie-Multilingual.FR-FR.Pascal.Calm. Refer to Voices and Emotional Styles for the naming convention and available emotions.

Voice Cloning#

The Magpie TTS Zeroshot and Magpie TTS Flow models clone a voice from a short reference audio recording (3–10 seconds, 16-bit mono WAV at 22.05 kHz or higher). Magpie TTS Flow also requires a transcript of the audio prompt. Refer to Voice Cloning for audio prompt preparation, quality tuning, and examples.

Batch Synthesis#

The WebSocket realtime client supports synthesizing speech from text files and processing multiple lines in parallel for higher throughput. Refer to Batch Synthesis for details.

SSML Customization#

The TTS NIM microservice supports a subset of Speech Synthesis Markup Language (SSML) for controlling phoneme pronunciation. Refer to Customizing TTS Models for details and examples.

Custom Pronunciation#

Define a text-based dictionary that maps words to IPA phonetic representations. The TTS model applies these custom pronunciations during synthesis. Refer to Phoneme Support for supported phonemes.

Next Steps#