Deploy and Run the TTS NIM Microservice#

Deploy a TTS model as a NIM container and run speech synthesis through gRPC, HTTP, or WebSocket.

For model details, GPU requirements, and supported voices, refer to the TTS support matrix.

Prerequisites#

Completed prerequisites and NGC access setup.
Installed the NVIDIA Riva Python client.
NGC_API_KEY exported in your terminal.

Deploy the NIM Container#

Each TTS model has its own container image. Set CONTAINER_ID and NIM_TAGS_SELECTOR based on the model you want to deploy.

Magpie TTS Multilingual

9 languages, streaming + offline, multi-voice with emotional styles.

For the container image, refer to the NGC catalog.

export CONTAINER_ID=magpie-tts-multilingual
export NIM_TAGS_SELECTOR=name=magpie-tts-multilingual

docker run -it --rm --name=$CONTAINER_ID \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=8GB \
  -e NGC_API_KEY \
  -e NIM_HTTP_API_PORT=9000 \
  -e NIM_GRPC_API_PORT=50051 \
  -p 9000:9000 \
  -p 50051:50051 \
  -e NIM_TAGS_SELECTOR \
  nvcr.io/nim/nvidia/$CONTAINER_ID:latest

The default model profile is batch_size=8. To select a different batch size, append it to NIM_TAGS_SELECTOR:

export NIM_TAGS_SELECTOR="name=magpie-tts-multilingual,batch_size=32"

Refer to the support matrix for available batch sizes and their GPU memory requirements.

Magpie TTS Zeroshot

English only, streaming + offline, voice cloning from audio prompt.

For the container image, refer to the NGC catalog.

export CONTAINER_ID=magpie-tts-zeroshot
export NIM_TAGS_SELECTOR=name=magpie-tts-zeroshot

docker run -it --rm --name=$CONTAINER_ID \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=8GB \
  -e NGC_API_KEY \
  -e NIM_HTTP_API_PORT=9000 \
  -e NIM_GRPC_API_PORT=50051 \
  -p 9000:9000 \
  -p 50051:50051 \
  -e NIM_TAGS_SELECTOR \
  nvcr.io/nim/nvidia/$CONTAINER_ID:latest

Note

Access to Magpie TTS Zeroshot is restricted. Apply for access.

Magpie TTS Flow

English only, offline only, voice cloning with audio prompt and transcript.

For the container image, refer to the NGC catalog.

export CONTAINER_ID=magpie-tts-flow
export NIM_TAGS_SELECTOR=name=magpie-tts-flow

docker run -it --rm --name=$CONTAINER_ID \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=8GB \
  -e NGC_API_KEY \
  -e NIM_HTTP_API_PORT=9000 \
  -e NIM_GRPC_API_PORT=50051 \
  -p 9000:9000 \
  -p 50051:50051 \
  -e NIM_TAGS_SELECTOR \
  nvcr.io/nim/nvidia/$CONTAINER_ID:latest

Note

Access to Magpie TTS Flow is restricted. Apply for access.

Chatterbox TTS Multilingual

23 languages, streaming + offline.

export CONTAINER_ID=chatterbox-tts-multilingual
export NIM_TAGS_SELECTOR="name=chatterbox-tts-multilingual"

docker run -it --rm --name=$CONTAINER_ID \
  --runtime=nvidia \
  --gpus '"device=0"' \
  --shm-size=8GB \
  -e NGC_API_KEY \
  -e NIM_HTTP_API_PORT=9000 \
  -e NIM_GRPC_API_PORT=50051 \
  -p 9000:9000 \
  -p 50051:50051 \
  -e NIM_TAGS_SELECTOR \
  nvcr.io/nim/nvidia/$CONTAINER_ID:latest

On first startup, the container downloads the model from NGC, which can take up to 30 minutes depending on network speed. A pre-built TensorRT engine is downloaded when available for the target GPU. Otherwise, the container generates an optimized engine from the RMIR model on-the-fly, which adds additional startup time.

Tip

Mount a local cache directory to avoid repeated downloads. See Model Caching.

Verify Readiness#

Wait for the container to finish model setup, then check the health endpoint.

curl -X 'GET' 'http://localhost:9000/v1/health/ready'

Expected response:

{"status":"ready"}

Run Speech Synthesis#

List Available Voices#

Before synthesizing, discover which voices the deployed model serves.

gRPC

python3 python-clients/scripts/tts/talk.py \
  --server 0.0.0.0:50051 \
  --list-voices

HTTP

curl -sS http://localhost:9000/v1/audio/list_voices | jq

The output lists voice names grouped by language. Use these names in the --voice parameter. For voice naming details, refer to Voices and Emotional Styles.

Offline Synthesis#

Offline synthesis generates the complete audio and returns it in a single response. The output is saved to output.wav.

gRPC

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
  --language-code en-US \
  --text "Deploy and run speech synthesis with NVIDIA TTS NIM." \
  --voice Magpie-Multilingual.EN-US.Aria \
  --output output.wav

HTTP

curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
  -F language=en-US \
  -F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
  -F voice=Magpie-Multilingual.EN-US.Aria \
  --output output.wav

Note

gRPC limits message size to 4 MB by default. If the synthesized audio exceeds this limit, use streaming synthesis instead.

Streaming Synthesis#

Streaming synthesis returns audio in chunks as they are generated, providing lower time-to-first-audio.

gRPC

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
  --language-code en-US \
  --text "Deploy and run speech synthesis with NVIDIA TTS NIM." \
  --voice Magpie-Multilingual.EN-US.Aria \
  --stream \
  --output output.wav

HTTP

The streaming HTTP API returns raw LPCM audio without a WAV header. Use sox to convert the output.

curl -sS http://localhost:9000/v1/audio/synthesize_online --fail-with-body \
  -F language=en-US \
  -F text="Deploy and run speech synthesis with NVIDIA TTS NIM." \
  -F voice=Magpie-Multilingual.EN-US.Aria \
  -F sample_rate_hz=22050 \
  --output output.raw

sox -b 16 -e signed -c 1 -r 22050 output.raw output.wav

WebSocket

The WebSocket realtime API provides the lowest latency for interactive applications.

python3 python-clients/scripts/tts/realtime_tts_client.py \
  --server localhost:9000 \
  --language-code en-US \
  --text "Deploy and run speech synthesis with NVIDIA TTS NIM." \
  --voice Magpie-Multilingual.EN-US.Aria \
  --output output.wav

Handling Long Inputs#

The TTS NIM enforces a per-request limit of 2,000 characters on the normalized text. Requests that exceed it are rejected with Input text is larger than the maximum input length: <size> > 2000. The limit applies to both offline and streaming synthesis. Streaming returns audio in chunks, but the input text itself is still passed in a single request and is validated up front. The check runs after text normalization (numbers, abbreviations, and SSML tags are expanded), so heavily marked-up input can exceed the cap with fewer raw characters than expected.

For longer content, split the source text into sentences or paragraphs and send one request per chunk. The client below reads a file, splits on blank lines, and synthesizes each block sequentially:

cat > long_input.txt <<'EOF'
Deploy and run speech synthesis with NVIDIA TTS NIM.
Streaming synthesis returns audio in chunks for low-latency playback.

The same NIM can also be invoked offline.
Offline synthesis is simpler but caps out at the gRPC message size limit.
EOF

while IFS= read -r chunk; do
  [ -z "$chunk" ] && continue
  python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "$chunk" \
    --voice Magpie-Multilingual.EN-US.Aria \
    --stream \
    --output "chunk_$(date +%s%N).wav"
done < long_input.txt

Concatenate the per-chunk WAV files (for example with sox in_*.wav out.wav) to produce a single audio track. Keep individual chunks well below 2,000 characters to leave headroom for normalization expansion.

Client Parameters Reference#

talk.py (gRPC)#

--text and --list-voices are mutually exclusive.

Parameter	Description	Default
`--server`	gRPC server address and port.	`0.0.0.0:50051`
`--text`	Text to synthesize.	–
`--list-voices`	List available voices, then exit.	–
`--voice`	Voice name to use. If omitted, the server selects the first available voice.	–
`--language-code`	Language code (for example, `en-US`).	`en-US`
`--output` / `-o`	Output WAV file path.	`output.wav`
`--stream`	Enable streaming synthesis.	`false`
`--sample-rate-hz`	Output audio sample rate.	`44100`
`--encoding`	Output encoding (`LINEAR_PCM` or `OGGOPUS`).	`LINEAR_PCM`
`--custom-dictionary`	Path to a custom pronunciation dictionary file.	–
`--custom-configuration`	Comma-separated `key:value` pairs for model-specific request parameters.	–
`--zero_shot_audio_prompt_file`	Path to audio prompt for voice cloning (3–10 s).	–
`--zero_shot_quality`	Voice cloning quality (1–40).	`20`
`--zero_shot_transcript`	Transcript of the audio prompt (Magpie TTS Flow only).	–
`--play-audio`	Play synthesized audio through speakers.	`false`

realtime_tts_client.py (WebSocket)#

Parameter	Description	Default
`--server`	WebSocket server address and port.	`localhost:9000`
`--text`	Direct text input to synthesize.	–
`--input-file`	Path to a text file for batch synthesis.	–
`--list-voices`	List available voices, then exit.	–
`--voice`	Voice name to use.	–
`--language-code`	Language code.	`en-US`
`--output` / `-o`	Output WAV file path.	–
`--num-parallel-requests`	Number of concurrent synthesis connections.	`1`
`--zero-shot-audio-prompt-file`	Path to audio prompt for voice cloning.	–
`--zero-shot-prompt-quality`	Voice cloning quality (1–40).	`20`
`--zero-shot-audio-prompt-transcript`	Transcript of the audio prompt.	–
`--custom-dictionary`	Path to custom pronunciation dictionary.	–
`--custom-configuration`	Comma-separated `key:value` pairs for model-specific request parameters.	–

Helm Deployment#

For Kubernetes deployment, create a custom-values.yaml file:

image:
  repository: nvcr.io/nim/nvidia/magpie-tts-multilingual
  pullPolicy: IfNotPresent
  tag: latest
nim:
  ngcAPISecret: ngc-api
imagePullSecrets:
  - name: ngc-secret
envVars:
  NIM_TAGS_SELECTOR: name=magpie-tts-multilingual

Replace the repository and NIM_TAGS_SELECTOR values for other models:

Model	Repository	NIM_TAGS_SELECTOR
Magpie TTS Multilingual	`nvcr.io/nim/nvidia/magpie-tts-multilingual`	`name=magpie-tts-multilingual`
Magpie TTS Zeroshot	`nvcr.io/nim/nvidia/magpie-tts-zeroshot`	`name=magpie-tts-zeroshot`
Magpie TTS Flow	`nvcr.io/nim/nvidia/magpie-tts-flow`	`name=magpie-tts-flow`
Chatterbox TTS Multilingual	`nvcr.io/nim/nvidia/chatterbox-tts-multilingual`	`name=chatterbox-tts-multilingual`

For complete Helm instructions, refer to Deploying with Helm.

Next Steps#

Voices and Emotional Styles: Voice naming convention and emotional variants.
Voice Cloning: Clone a voice from a reference audio recording.
Customizing TTS Models: SSML tags and custom pronunciation dictionaries.
Batch Synthesis: Synthesize from text files with parallel processing.
TTS Troubleshooting: Common issues and solutions.