Is this page helpful?

Batch Synthesis from Text Files#

The WebSocket realtime client (realtime_tts_client.py) supports synthesizing speech from text files and processing multiple lines in parallel. This is useful for converting large text datasets, generating audio for multiple prompts, or benchmarking throughput.

Prerequisites#

A deployed TTS NIM microservice. Refer to the TTS tutorial for deployment steps.
Installed the NVIDIA Riva Python client.

Prepare a Text File#

The realtime client accepts two input file formats.

Plain Text Format#

Each line becomes a separate synthesis request. Empty lines are skipped.

Welcome to NVIDIA speech synthesis.
This is the second line of text to synthesize.
Each line produces a separate audio file.

Pipe-Separated Format#

Each line contains an identifier and text separated by a pipe (|). The client extracts the text portion after the pipe.

audio_001|Welcome to NVIDIA speech synthesis.
audio_002|This is the second line of text to synthesize.
audio_003|Each line produces a separate audio file.

This format is common in speech dataset pipelines where each line corresponds to a labeled audio sample.

Synthesize from a Text File#

Pass the file path with --input-file. Each non-empty line is synthesized independently.

python3 python-clients/scripts/tts/realtime_tts_client.py \
    --server localhost:9000 \
    --language-code en-US \
    --voice Magpie-Multilingual.EN-US.Aria \
    --input-file input.txt \
    --output output.wav

With a single request (default), the client processes lines sequentially. Each line produces a separate WAV file named with a numeric index:

output0.wav – first line
output1.wav – second line
output2.wav – third line

The index is appended before the file extension of the --output filename.

Process Lines in Parallel#

Use --num-parallel-requests to synthesize multiple lines concurrently. The client opens multiple WebSocket connections and limits concurrency with a semaphore.

python3 python-clients/scripts/tts/realtime_tts_client.py \
    --server localhost:9000 \
    --language-code en-US \
    --voice Magpie-Multilingual.EN-US.Aria \
    --input-file input.txt \
    --num-parallel-requests 4 \
    --output output.wav

This processes up to 4 lines simultaneously, reducing total synthesis time for large files.

Note

Each parallel request opens a separate WebSocket connection. Set --num-parallel-requests based on your server capacity and GPU memory. Higher values increase throughput but also increase GPU memory and CPU usage.

Combine with Voice Cloning#

Batch synthesis works with zero-shot voice cloning. Add the audio prompt flags alongside --input-file.

python3 python-clients/scripts/tts/realtime_tts_client.py \
    --server localhost:9000 \
    --language-code en-US \
    --input-file input.txt \
    --zero-shot-audio-prompt-file prompt.wav \
    --num-parallel-requests 2 \
    --output output.wav

Combine with a Custom Dictionary#

Apply custom pronunciations to all lines in the batch by passing a dictionary file.

python3 python-clients/scripts/tts/realtime_tts_client.py \
    --server localhost:9000 \
    --language-code en-US \
    --voice Magpie-Multilingual.EN-US.Aria \
    --input-file input.txt \
    --custom-dictionary custom_dict.txt \
    --output output.wav

Refer to Customizing TTS Models for dictionary format details.

Play Audio in Real Time#

Use --play-audio to play each synthesized result through the system speakers as it completes. This works with or without --output.

python3 python-clients/scripts/tts/realtime_tts_client.py \
    --server localhost:9000 \
    --language-code en-US \
    --text "Play this audio through the speakers." \
    --play-audio

Note

Real-time playback requires PyAudio. Install it with pip install pyaudio.

Reference: Realtime Client Flags#

Flag	Default	Description
`--input-file`	–	Path to a text file (plain or pipe-separated)
`--num-parallel-requests`	`1`	Number of concurrent WebSocket connections
`--output` / `-o`	–	Output WAV file path (indexed for multiple lines)
`--play-audio`	`false`	Play audio through system speakers
`--encoding`	`LINEAR_PCM`	Output encoding (`LINEAR_PCM` or `OGGOPUS`)
`--sample-rate-hz`	`44100`	Output audio sample rate
`--debug`	`false`	Enable debug logging