Deploy and Run the TTS NIM Microservice#
Deploy a TTS model as a NIM container and run speech synthesis through gRPC, HTTP, or WebSocket.
For model details, GPU requirements, and supported voices, refer to the TTS support matrix.
Prerequisites#
Completed prerequisites and NGC access setup.
Installed the NVIDIA Riva Python client.
NGC_API_KEYexported in your terminal.
Deploy the NIM Container#
Each TTS model has its own container image. Set CONTAINER_ID and NIM_TAGS_SELECTOR based on the model you want to deploy.
7 languages, streaming + offline, multi-voice with emotional styles.
export CONTAINER_ID=magpie-tts-multilingual
export NIM_TAGS_SELECTOR=name=magpie-tts-multilingual
docker run -it --rm --name=$CONTAINER_ID \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-e NIM_TAGS_SELECTOR \
nvcr.io/nim/nvidia/$CONTAINER_ID:latest
The default model profile is batch_size=8. To select a different batch size, append it to NIM_TAGS_SELECTOR:
export NIM_TAGS_SELECTOR="name=magpie-tts-multilingual,batch_size=32"
Refer to the support matrix for available batch sizes and their GPU memory requirements.
English only, streaming + offline, voice cloning from audio prompt.
export CONTAINER_ID=magpie-tts-zeroshot
export NIM_TAGS_SELECTOR=name=magpie-tts-zeroshot
docker run -it --rm --name=$CONTAINER_ID \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-e NIM_TAGS_SELECTOR \
nvcr.io/nim/nvidia/$CONTAINER_ID:latest
Note
Access to Magpie TTS Zeroshot is restricted. Apply for access.
English only, offline only, voice cloning with audio prompt and transcript.
export CONTAINER_ID=magpie-tts-flow
export NIM_TAGS_SELECTOR=name=magpie-tts-flow
docker run -it --rm --name=$CONTAINER_ID \
--runtime=nvidia \
--gpus '"device=0"' \
--shm-size=8GB \
-e NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-p 9000:9000 \
-p 50051:50051 \
-e NIM_TAGS_SELECTOR \
nvcr.io/nim/nvidia/$CONTAINER_ID:latest
Note
Access to Magpie TTS Flow is restricted. Apply for access.
On first startup, the container downloads the model from NGC, which can take up to 30 minutes depending on network speed. A pre-built TensorRT engine is downloaded when available for the target GPU. Otherwise, the container generates an optimized engine from the RMIR model on-the-fly, which adds additional startup time.
Tip
Mount a local cache directory to avoid repeated downloads. See Model Caching.
Verify Readiness#
Wait for the container to finish model setup, then check the health endpoint.
curl -X 'GET' 'http://localhost:9000/v1/health/ready'
Expected response:
{"status":"ready"}
Run Speech Synthesis#
List Available Voices#
Before synthesizing, discover which voices the deployed model serves.
python3 python-clients/scripts/tts/talk.py \
--server 0.0.0.0:50051 \
--list-voices
curl -sS http://localhost:9000/v1/audio/list_voices | jq
The output lists voice names grouped by language. Use these names in the --voice parameter. For voice naming details, refer to Voices and Emotional Styles.
Offline Synthesis#
Offline synthesis generates the complete audio and returns it in a single response. The output is saved to output.wav.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Deploy and run speech synthesis with NVIDIA TTS NIM." \
--voice Magpie-Multilingual.EN-US.Aria \
--output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
-F language=en-US \
-F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
-F voice=Magpie-Multilingual.EN-US.Aria \
--output output.wav
Note
gRPC limits message size to 4 MB by default. If the synthesized audio exceeds this limit, use streaming synthesis instead.
Streaming Synthesis#
Streaming synthesis returns audio in chunks as they are generated, providing lower time-to-first-audio.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Deploy and run speech synthesis with NVIDIA TTS NIM." \
--voice Magpie-Multilingual.EN-US.Aria \
--stream \
--output output.wav
The streaming HTTP API returns raw LPCM audio without a WAV header. Use sox to convert the output.
curl -sS http://localhost:9000/v1/audio/synthesize_online --fail-with-body \
-F language=en-US \
-F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
-F voice=Magpie-Multilingual.EN-US.Aria \
-F sample_rate_hz=22050 \
--output output.raw
sox -b 16 -e signed -c 1 -r 22050 output.raw output.wav
The WebSocket realtime API provides the lowest latency for interactive applications.
python3 python-clients/scripts/tts/realtime_tts_client.py \
--server localhost:9000 \
--language-code en-US \
--text "Deploy and run speech synthesis with NVIDIA TTS NIM." \
--voice Magpie-Multilingual.EN-US.Aria \
--output output.wav
Client Parameters Reference#
talk.py (gRPC)#
--text and --list-voices are mutually exclusive.
Parameter |
Description |
Default |
|---|---|---|
|
gRPC server address and port. |
|
|
Text to synthesize. |
– |
|
List available voices, then exit. |
– |
|
Voice name to use. If omitted, the server selects the first available voice. |
– |
|
Language code (for example, |
|
|
Output WAV file path. |
|
|
Enable streaming synthesis. |
|
|
Output audio sample rate. |
|
|
Output encoding ( |
|
|
Path to a custom pronunciation dictionary file. |
– |
|
Path to audio prompt for voice cloning (3–10 s). |
– |
|
Voice cloning quality (1–40). |
|
|
Transcript of the audio prompt (Magpie TTS Flow only). |
– |
|
Play synthesized audio through speakers. |
|
realtime_tts_client.py (WebSocket)#
Parameter |
Description |
Default |
|---|---|---|
|
WebSocket server address and port. |
|
|
Direct text input to synthesize. |
– |
|
Path to a text file for batch synthesis. |
– |
|
List available voices, then exit. |
– |
|
Voice name to use. |
– |
|
Language code. |
|
|
Output WAV file path. |
– |
|
Number of concurrent synthesis connections. |
|
|
Path to audio prompt for voice cloning. |
– |
|
Voice cloning quality (1–40). |
|
|
Transcript of the audio prompt. |
– |
|
Path to custom pronunciation dictionary. |
– |
Helm Deployment#
For Kubernetes deployment, create a custom-values.yaml file:
image:
repository: nvcr.io/nim/nvidia/magpie-tts-multilingual
pullPolicy: IfNotPresent
tag: latest
nim:
ngcAPISecret: ngc-api
imagePullSecrets:
- name: ngc-secret
envVars:
NIM_TAGS_SELECTOR: name=magpie-tts-multilingual
Replace the repository and NIM_TAGS_SELECTOR values for other models:
Model |
Repository |
NIM_TAGS_SELECTOR |
|---|---|---|
Magpie TTS Multilingual |
|
|
Magpie TTS Zeroshot |
|
|
Magpie TTS Flow |
|
|
For complete Helm instructions, refer to Deploying with Helm.
Next Steps#
Voices and Emotional Styles: Voice naming convention and emotional variants.
Voice Cloning: Clone a voice from a reference audio recording.
Customizing TTS Models: SSML tags and custom pronunciation dictionaries.
Batch Synthesis: Synthesize from text files with parallel processing.
TTS Troubleshooting: Common issues and solutions.