Deploy and Run the TTS NIM Microservice#

Deploy a TTS model as a NIM container and run speech synthesis through gRPC, HTTP, or WebSocket.

For model details, GPU requirements, and supported voices, refer to the TTS support matrix.

Prerequisites#

Deploy the NIM Container#

Each TTS model has its own container image. Set CONTAINER_ID and NIM_TAGS_SELECTOR based on the model you want to deploy.

7 languages, streaming + offline, multi-voice with emotional styles.

export CONTAINER_ID=magpie-tts-multilingual
export NIM_TAGS_SELECTOR=name=magpie-tts-multilingual

docker run -it --rm --name=$CONTAINER_ID \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=8GB \
  -e NGC_API_KEY \
  -e NIM_HTTP_API_PORT=9000 \
  -e NIM_GRPC_API_PORT=50051 \
  -p 9000:9000 \
  -p 50051:50051 \
  -e NIM_TAGS_SELECTOR \
  nvcr.io/nim/nvidia/$CONTAINER_ID:latest

The default model profile is batch_size=8. To select a different batch size, append it to NIM_TAGS_SELECTOR:

export NIM_TAGS_SELECTOR="name=magpie-tts-multilingual,batch_size=32"

Refer to the support matrix for available batch sizes and their GPU memory requirements.

English only, streaming + offline, voice cloning from audio prompt.

export CONTAINER_ID=magpie-tts-zeroshot
export NIM_TAGS_SELECTOR=name=magpie-tts-zeroshot

docker run -it --rm --name=$CONTAINER_ID \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=8GB \
  -e NGC_API_KEY \
  -e NIM_HTTP_API_PORT=9000 \
  -e NIM_GRPC_API_PORT=50051 \
  -p 9000:9000 \
  -p 50051:50051 \
  -e NIM_TAGS_SELECTOR \
  nvcr.io/nim/nvidia/$CONTAINER_ID:latest

Note

Access to Magpie TTS Zeroshot is restricted. Apply for access.

English only, offline only, voice cloning with audio prompt and transcript.

export CONTAINER_ID=magpie-tts-flow
export NIM_TAGS_SELECTOR=name=magpie-tts-flow

docker run -it --rm --name=$CONTAINER_ID \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=8GB \
  -e NGC_API_KEY \
  -e NIM_HTTP_API_PORT=9000 \
  -e NIM_GRPC_API_PORT=50051 \
  -p 9000:9000 \
  -p 50051:50051 \
  -e NIM_TAGS_SELECTOR \
  nvcr.io/nim/nvidia/$CONTAINER_ID:latest

Note

Access to Magpie TTS Flow is restricted. Apply for access.

On first startup, the container downloads the model from NGC, which can take up to 30 minutes depending on network speed. A pre-built TensorRT engine is downloaded when available for the target GPU. Otherwise, the container generates an optimized engine from the RMIR model on-the-fly, which adds additional startup time.

Tip

Mount a local cache directory to avoid repeated downloads. See Model Caching.

Verify Readiness#

Wait for the container to finish model setup, then check the health endpoint.

curl -X 'GET' 'http://localhost:9000/v1/health/ready'

Expected response:

{"status":"ready"}

Run Speech Synthesis#

List Available Voices#

Before synthesizing, discover which voices the deployed model serves.

python3 python-clients/scripts/tts/talk.py \
  --server 0.0.0.0:50051 \
  --list-voices
curl -sS http://localhost:9000/v1/audio/list_voices | jq

The output lists voice names grouped by language. Use these names in the --voice parameter. For voice naming details, refer to Voices and Emotional Styles.

Offline Synthesis#

Offline synthesis generates the complete audio and returns it in a single response. The output is saved to output.wav.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
  --language-code en-US \
  --text "Deploy and run speech synthesis with NVIDIA TTS NIM." \
  --voice Magpie-Multilingual.EN-US.Aria \
  --output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
  -F language=en-US \
  -F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
  -F voice=Magpie-Multilingual.EN-US.Aria \
  --output output.wav

Note

gRPC limits message size to 4 MB by default. If the synthesized audio exceeds this limit, use streaming synthesis instead.

Streaming Synthesis#

Streaming synthesis returns audio in chunks as they are generated, providing lower time-to-first-audio.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
  --language-code en-US \
  --text "Deploy and run speech synthesis with NVIDIA TTS NIM." \
  --voice Magpie-Multilingual.EN-US.Aria \
  --stream \
  --output output.wav

The streaming HTTP API returns raw LPCM audio without a WAV header. Use sox to convert the output.

curl -sS http://localhost:9000/v1/audio/synthesize_online --fail-with-body \
  -F language=en-US \
  -F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
  -F voice=Magpie-Multilingual.EN-US.Aria \
  -F sample_rate_hz=22050 \
  --output output.raw
sox -b 16 -e signed -c 1 -r 22050 output.raw output.wav

The WebSocket realtime API provides the lowest latency for interactive applications.

python3 python-clients/scripts/tts/realtime_tts_client.py \
  --server localhost:9000 \
  --language-code en-US \
  --text "Deploy and run speech synthesis with NVIDIA TTS NIM." \
  --voice Magpie-Multilingual.EN-US.Aria \
  --output output.wav

Client Parameters Reference#

talk.py (gRPC)#

--text and --list-voices are mutually exclusive.

Parameter

Description

Default

--server

gRPC server address and port.

0.0.0.0:50051

--text

Text to synthesize.

--list-voices

List available voices, then exit.

--voice

Voice name to use. If omitted, the server selects the first available voice.

--language-code

Language code (for example, en-US).

en-US

--output / -o

Output WAV file path.

output.wav

--stream

Enable streaming synthesis.

false

--sample-rate-hz

Output audio sample rate.

44100

--encoding

Output encoding (LINEAR_PCM or OGGOPUS).

LINEAR_PCM

--custom-dictionary

Path to a custom pronunciation dictionary file.

--zero_shot_audio_prompt_file

Path to audio prompt for voice cloning (3–10 s).

--zero_shot_quality

Voice cloning quality (1–40).

20

--zero_shot_transcript

Transcript of the audio prompt (Magpie TTS Flow only).

--play-audio

Play synthesized audio through speakers.

false

realtime_tts_client.py (WebSocket)#

Parameter

Description

Default

--server

WebSocket server address and port.

localhost:9000

--text

Direct text input to synthesize.

--input-file

Path to a text file for batch synthesis.

--list-voices

List available voices, then exit.

--voice

Voice name to use.

--language-code

Language code.

en-US

--output / -o

Output WAV file path.

--num-parallel-requests

Number of concurrent synthesis connections.

1

--zero-shot-audio-prompt-file

Path to audio prompt for voice cloning.

--zero-shot-prompt-quality

Voice cloning quality (1–40).

20

--zero-shot-audio-prompt-transcript

Transcript of the audio prompt.

--custom-dictionary

Path to custom pronunciation dictionary.

Helm Deployment#

For Kubernetes deployment, create a custom-values.yaml file:

image:
  repository: nvcr.io/nim/nvidia/magpie-tts-multilingual
  pullPolicy: IfNotPresent
  tag: latest
nim:
  ngcAPISecret: ngc-api
imagePullSecrets:
  - name: ngc-secret
envVars:
  NIM_TAGS_SELECTOR: name=magpie-tts-multilingual

Replace the repository and NIM_TAGS_SELECTOR values for other models:

Model

Repository

NIM_TAGS_SELECTOR

Magpie TTS Multilingual

nvcr.io/nim/nvidia/magpie-tts-multilingual

name=magpie-tts-multilingual

Magpie TTS Zeroshot

nvcr.io/nim/nvidia/magpie-tts-zeroshot

name=magpie-tts-zeroshot

Magpie TTS Flow

nvcr.io/nim/nvidia/magpie-tts-flow

name=magpie-tts-flow

For complete Helm instructions, refer to Deploying with Helm.

Next Steps#