Is this page helpful?

Run Your First NVIDIA Speech NIM Microservice for Text-to-Speech#

In this tutorial, you will deploy a TTS (Text-to-Speech) NIM microservice and use it to synthesize spoken audio from text. You will learn how to select voices and languages, understand the difference between offline and streaming synthesis, and know when to use each API.

If you completed the ASR tutorial first, you will recognize the same deployment workflow: set environment variables, start the container, check readiness, and send requests.

What You Learn#

By completing this tutorial, you:

Deploy a TTS NIM using the same CONTAINER_ID and NIM_TAGS_SELECTOR pattern you learned in the ASR tutorial.
Discover and select voices from the available voice catalog.
Understand the difference between offline synthesis (full audio in one response) and streaming synthesis (audio returned in chunks for low-latency playback).
Know the gRPC message size limit and when streaming is required instead of offline.
Interact with the TTS NIM through gRPC, HTTP, and WebSocket.

What You Need#

A Linux system with a supported NVIDIA GPU (refer to the support matrix).
Completed setup: prerequisites and NGC access.
A terminal with Docker available and NGC_API_KEY exported.
Installed the NVIDIA Riva Python client.
Approximately 30-45 minutes (includes model download time on first run).

Key Concepts#

Concept	Description
Voices and languages	TTS models contain multiple voices, each associated with a language and locale. For example, `Magpie-Multilingual.EN-US.Aria` is an English (US) voice named Aria. You select a voice when making a synthesis request.
Offline vs. streaming synthesis	Offline synthesis generates the full audio and returns it in a single response. This is simpler but has a gRPC message size limit of 4 MB. Streaming synthesis returns audio in chunks as they are generated, which provides lower time-to-first-audio and handles arbitrarily long text.

Step 1: Deploy the TTS NIM Microservice#

Choose a model from the following tabs and deploy it as a TTS NIM microservice.

Tip

To find all available TTS models and languages, refer to Supported Models.

Magpie TTS Multilingual

export CONTAINER_ID=magpie-tts-multilingual
export NIM_TAGS_SELECTOR=name=magpie-tts-multilingual

docker run -it --rm --name=$CONTAINER_ID \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=8GB \
  -e NGC_API_KEY \
  -e NIM_HTTP_API_PORT=9000 \
  -e NIM_GRPC_API_PORT=50051 \
  -p 9000:9000 \
  -p 50051:50051 \
  -e NIM_TAGS_SELECTOR \
  nvcr.io/nim/nvidia/$CONTAINER_ID:latest

Magpie TTS Zeroshot

Zeroshot TTS can clone a voice from a reference audio sample.

export CONTAINER_ID=magpie-tts-zeroshot
export NIM_TAGS_SELECTOR=name=magpie-tts-zeroshot

docker run -it --rm --name=$CONTAINER_ID \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=8GB \
  -e NGC_API_KEY \
  -e NIM_HTTP_API_PORT=9000 \
  -e NIM_GRPC_API_PORT=50051 \
  -p 9000:9000 \
  -p 50051:50051 \
  -e NIM_TAGS_SELECTOR \
  nvcr.io/nim/nvidia/$CONTAINER_ID:latest

Magpie TTS Flow

export CONTAINER_ID=magpie-tts-flow
export NIM_TAGS_SELECTOR=name=magpie-tts-flow

docker run -it --rm --name=$CONTAINER_ID \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=8GB \
  -e NGC_API_KEY \
  -e NIM_HTTP_API_PORT=9000 \
  -e NIM_GRPC_API_PORT=50051 \
  -p 9000:9000 \
  -p 50051:50051 \
  -e NIM_TAGS_SELECTOR \
  nvcr.io/nim/nvidia/$CONTAINER_ID:latest

Note

Access to Magpie TTS Zeroshot and Magpie TTS Flow is restricted. Request access via this form.

Like the ASR NIM, the first startup downloads the model and can build optimized engines. This can take up to 30 minutes. The container is ready when the health check returns ready.

Step 2: Check Service Readiness#

Open a new terminal and verify the service is ready.

curl -X 'GET' 'http://localhost:9000/v1/health/ready'

Expected response:

{"status":"ready"}

Step 3: Discover Available Voices#

Before synthesizing speech, list the voices the NIM is serving. This is the TTS equivalent of listing models in ASR.

gRPC#

python3 python-clients/scripts/tts/talk.py \
  --server 0.0.0.0:50051 \
  --list-voices

HTTP#

curl -sS http://localhost:9000/v1/audio/list_voices | jq

The output lists voice names organized by language. For example, Magpie-Multilingual.EN-US.Aria and Magpie-Multilingual.EN-US.Jason are two English (US) voices. You will use these voice names in synthesis requests.

Different applications need different voices (gender, language, accent). Knowing how to discover and select voices is essential for integrating TTS into your application.

What You Learned#

In this tutorial, you have learned the following:

Deployed a TTS NIM using the same workflow as ASR: set CONTAINER_ID and NIM_TAGS_SELECTOR, start the container, and check readiness.
Discovered voices using --list-voices and the HTTP voice listing endpoint, and learned that voices are identified by model, locale, and name.
Compared offline and streaming synthesis. Offline returns the full audio in one response and is simplest to use. Streaming returns audio in chunks, providing lower latency and handling longer text without hitting the gRPC message size limit.
Used three protocols: gRPC (primary, supports both modes), HTTP/REST (simple integration), and WebSocket (browser-friendly streaming).

Next Steps#

NMT Tutorial: Continue the learning path by deploying a translation NIM. You will learn how to control translation output with exclusion tags and custom dictionaries.
TTS Developer Guide: Explore TTS capabilities in depth, including SSML, custom pronunciation, and phoneme support.
Deploy TTS with Helm: Kubernetes deployment for production.