Run Your First NVIDIA Speech NIM Microservice for Text-to-Speech#
In this tutorial, you will deploy a TTS (Text-to-Speech) NIM microservice and use it to synthesize spoken audio from text. You will learn how to select voices and languages, understand the difference between offline and streaming synthesis, and know when to use each API.
If you completed the ASR tutorial first, you will recognize the same deployment workflow: set environment variables, start the container, check readiness, and send requests.
What You Learn#
By completing this tutorial, you:
Deploy a TTS NIM using the same
CONTAINER_IDandNIM_TAGS_SELECTORpattern you learned in the ASR tutorial.Discover and select voices from the available voice catalog.
Understand the difference between offline synthesis (full audio in one response) and streaming synthesis (audio returned in chunks for low-latency playback).
Know the gRPC message size limit and when streaming is required instead of offline.
Interact with the TTS NIM through gRPC, HTTP, and WebSocket.
What You Need#
A Linux system with a supported NVIDIA GPU (refer to the support matrix).
Completed setup: prerequisites and NGC access.
A terminal with Docker available and
NGC_API_KEYexported.Installed the NVIDIA Riva Python client.
Approximately 30-45 minutes (includes model download time on first run).
Key Concepts#
Concept |
Description |
|---|---|
Voices and languages |
TTS models contain multiple voices, each associated with a language and locale. For example, |
Offline vs. streaming synthesis |
Offline synthesis generates the full audio and returns it in a single response. This is simpler but has a gRPC message size limit of 4 MB. Streaming synthesis returns audio in chunks as they are generated, which provides lower time-to-first-audio and handles arbitrarily long text. |
Step 1: Deploy the TTS NIM Microservice#
Choose a model from the following tabs and deploy it as a TTS NIM microservice.
Tip
To find all available TTS models and languages, refer to Supported Models.
export CONTAINER_ID=magpie-tts-multilingual
export NIM_TAGS_SELECTOR=name=magpie-tts-multilingual
docker run -it --rm --name=$CONTAINER_ID \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-e NIM_TAGS_SELECTOR \
nvcr.io/nim/nvidia/$CONTAINER_ID:latest
Zeroshot TTS can clone a voice from a reference audio sample.
export CONTAINER_ID=magpie-tts-zeroshot
export NIM_TAGS_SELECTOR=name=magpie-tts-zeroshot
docker run -it --rm --name=$CONTAINER_ID \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-e NIM_TAGS_SELECTOR \
nvcr.io/nim/nvidia/$CONTAINER_ID:latest
export CONTAINER_ID=magpie-tts-flow
export NIM_TAGS_SELECTOR=name=magpie-tts-flow
docker run -it --rm --name=$CONTAINER_ID \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-e NIM_TAGS_SELECTOR \
nvcr.io/nim/nvidia/$CONTAINER_ID:latest
Note
Access to Magpie TTS Zeroshot and Magpie TTS Flow is restricted. Request access via this form.
Like the ASR NIM, the first startup downloads the model and can build optimized engines. This can take up to 30 minutes. The container is ready when the health check returns ready.
Step 2: Check Service Readiness#
Open a new terminal and verify the service is ready.
curl -X 'GET' 'http://localhost:9000/v1/health/ready'
Expected response:
{"status":"ready"}
Step 3: Discover Available Voices#
Before synthesizing speech, list the voices the NIM is serving. This is the TTS equivalent of listing models in ASR.
gRPC#
python3 python-clients/scripts/tts/talk.py \
--server 0.0.0.0:50051 \
--list-voices
HTTP#
curl -sS http://localhost:9000/v1/audio/list_voices | jq
The output lists voice names organized by language. For example, Magpie-Multilingual.EN-US.Aria and Magpie-Multilingual.EN-US.Jason are two English (US) voices. You will use these voice names in synthesis requests.
Different applications need different voices (gender, language, accent). Knowing how to discover and select voices is essential for integrating TTS into your application.
Step 4: Synthesize Speech#
Choose the TTS model you deployed in Step 1 and follow the instructions to synthesize speech.
The Magpie TTS Multilingual model supports text to speech in multiple languages.
Ensure that you have deployed the Magpie TTS Multilingual model, by referring to Supported Models section.
The following sections show how to use the model with a sample Python client and curl commands for the gRPC and HTTP APIs, respectively.
List available models and voices
python3 python-clients/scripts/tts/talk.py \
--server 0.0.0.0:50051 \
--list-voices
curl -sS http://localhost:9000/v1/audio/list_voices | jq
Output is piped to jq command to format the JSON string for better readability.
You will refer to a output with list of voices for supported languages. Output is truncated for brevity.
{
"en-US,es-US,fr-FR,de-DE,zh-CN": {
"voices": [
"Magpie-Multilingual.EN-US.Jason",
"Magpie-Multilingual.EN-US.Aria",
"Magpie-Multilingual.EN-US.Leo",
...
"Magpie-Multilingual.DE-DE.Leo",
"Magpie-Multilingual.DE-DE.Aria",
...
"Magpie-Multilingual.ZH-CN.Mia",
"Magpie-Multilingual.ZH-CN.Aria"
]
}
}
Synthesize speech with Offline API
With Offline API, entire synthesized speech is returned to the client at once. Synthesized speech will be saved in output.wav.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--voice Magpie-Multilingual.EN-US.Aria \
--output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
-F language=en-US \
-F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
-F voice=Magpie-Multilingual.EN-US.Aria \
--output output.wav
It is possible to use intermix the voices and languages to generate voice with different accents. For example, one can synthesize English speech with French accent with below command.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--voice Magpie-Multilingual.FR-FR.Pascal \
--output output.wav
Note
By default, gRPC limits incoming message size to 4 MB. As the Offline API returns synthesized speech in a single chunk, an error will occur if the synthesized speech exceeds this size. In such cases, we recommend using the Streaming API instead.
Synthesize speech with Streaming API
With Streaming API, the synthesized speech is returned in chunks as they are synthesized. Streaming API is recommended in real-time applications which require lowest latency.
Streaming transcription using gRPC and Realtime Websocket API
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--voice Magpie-Multilingual.EN-US.Aria \
--stream \
--output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize_online --fail-with-body \
-F language=en-US \
-F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
-F voice=Magpie-Multilingual.EN-US.Aria \
-F sample_rate_hz=22050 \
--output output.raw
Streaming HTTP API output is in RAW LPCM format without WAV header. sox can be used to prefix a WAV header and save it as WAV file.
sox -b 16 -e signed -c 1 -r 22050 output.raw output.wav
python3 python-clients/scripts/tts/realtime_tts_client.py \
--server 0.0.0.0:9000 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--voice Magpie-Multilingual.EN-US.Aria \
--output output.wav
Realtime WebSocket API provides the lowest latency for interactive applications by streaming text data and receiving real-time speech audio output.
The Magpie TTS Zeroshot model supports text to speech in English using an audio prompt. Voice characteristics from the audio prompt are applied to the synthesized output speech. This model supports streaming and offline inference.
The following sections show how to use the model with a sample Python client and curl commands for the gRPC and HTTP APIs, respectively.
Make sure you have deployed the Magpie TTS Zeroshot, by referring to Supported Models section.
You can create a audio prompt using any voice recording application.
Guidelines for creating an effective audio prompt
Audio format must be 16-bit Mono WAV file with a sample rate of 22.05 kHz or higher.
Aim for a duration of five seconds.
Trim silence from the beginning and end so that speech fills most of the prompt.
Record the prompt in a noise-free environment.
The following commands use the sample audio prompt provided at python-clients/data/examples/sample_audio_prompt.wav. Synthesized speech will be saved in output.wav and the voice will have characteristics similar to those in the provided audio prompt. If you are using your own audio prompt, make sure to update the value passed to the --zero_shot_audio_prompt_file argument.
Synthesize speech with Offline API
With Offline API, entire synthesized speech is returned to the client at once.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--zero_shot_audio_prompt_file python-clients/data/examples/sample_audio_prompt.wav \
--output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
-F language=en-US \
-F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
-F audio_prompt=@$HOME/python-clients/data/examples/sample_audio_prompt.wav \
--output output.wav
Note
@ is mandatory in the HTTP audio_prompt parameter as per curl command syntax.
Synthesize speech with Streaming API
With Streaming API, the synthesized speech is returned in chunks as they are synthesized. Streaming API is recommended in real-time applications which require lowest latency.
Streaming transcription using gRPC and Realtime Websocket API
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--zero_shot_audio_prompt_file python-clients/data/examples/sample_audio_prompt.wav \
--stream \
--output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize_online --fail-with-body \
-F language=en-US \
-F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
-F audio_prompt=@$HOME/python-clients/data/examples/sample_audio_prompt.wav \
--output output.wav
python3 python-clients/scripts/tts/realtime_tts_client.py \
--server 0.0.0.0:9000 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--zero_shot_audio_prompt_file python-clients/data/examples/sample_audio_prompt.wav \
--output output.wav
Note
@ is mandatory in the HTTP audio_prompt parameter as per curl command syntax.
The Magpie TTS Flow model supports text to speech in English using an audio prompt and prompt transcript text. Voice characteristics from the audio prompt are applied to the synthesized output speech. Compared to Magpie TTS Zeroshot model, this model requires additional prompt transcript text as input. This model supports only offline inference.
The following sections show how to use the model with a sample Python client and curl commands for the gRPC and HTTP APIs, respectively.
Make sure you have deployed the Magpie TTS Flow, by referring to Supported Models section.
Audio prompt can be created using any voice recording application.
Guidelines for creating an effective audio prompt
Audio format must be 16-bit Mono WAV file with a sample rate of 22.05 kHz or higher.
Aim for a duration of five seconds.
Trim silence from the beginning and end so that speech fills most of the prompt.
Record the prompt in a noise-free environment.
The following commands use the sample audio prompt provided at python-clients/data/examples/sample_audio_prompt.wav. Synthesized speech will be saved in output.wav and the voice will have characteristics similar to those in the provided audio prompt. If you are using your own audio prompt, make sure to update the value passed to the --zero_shot_audio_prompt_file and --zero_shot_transcript arguments.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--zero_shot_audio_prompt_file python-clients/data/examples/sample_audio_prompt.wav \
--zero_shot_transcript "I consent to use my voice to create a synthetic voice." \
--output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
-F language=en-US \
-F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
-F sample_rate_hz=22050 \
-F audio_prompt="@$HOME/python-clients/data/examples/sample_audio_prompt.wav" \
-F audio_prompt_transcript="I consent to use my voice to create a synthetic voice." \
--output output.wav
Note
The Magpie TTS Flow model supports only offline APIs.
@ is mandatory in the HTTP audio_prompt parameter as per curl command syntax.
What You Learned#
In this tutorial, you have learned the following:
Deployed a TTS NIM using the same workflow as ASR: set
CONTAINER_IDandNIM_TAGS_SELECTOR, start the container, and check readiness.Discovered voices using
--list-voicesand the HTTP voice listing endpoint, and learned that voices are identified by model, locale, and name.Compared offline and streaming synthesis. Offline returns the full audio in one response and is simplest to use. Streaming returns audio in chunks, providing lower latency and handling longer text without hitting the gRPC message size limit.
Used three protocols: gRPC (primary, supports both modes), HTTP/REST (simple integration), and WebSocket (browser-friendly streaming).
Next Steps#
NMT Tutorial: Continue the learning path by deploying a translation NIM. You will learn how to control translation output with exclusion tags and custom dictionaries.
TTS Developer Guide: Explore TTS capabilities in depth, including SSML, custom pronunciation, and phoneme support.
Deploy TTS with Helm: Kubernetes deployment for production.