Is this page helpful?

Cloning a Voice with Zero-Shot TTS#

The Magpie TTS Zeroshot and Magpie TTS Flow models synthesize speech that matches the voice characteristics of a short reference audio recording. This guide covers how to prepare an audio prompt, tune quality settings, and synthesize cloned speech through gRPC, HTTP, and WebSocket.

Prerequisites#

A deployed TTS NIM running either the Magpie TTS Zeroshot or Magpie TTS Flow model. Refer to the TTS tutorial for deployment steps.
Installed the NVIDIA Riva Python client.

Note

Both models require access approval. Request access through this form.

Choose a Model#

Capability	Magpie TTS Zeroshot	Magpie TTS Flow
Inference modes	Streaming + Offline	Offline only
Audio prompt	Required	Required
Transcript of prompt	Not used	Required
Quality parameter	1–40 (default: 20)	Not applicable
Built-in voices	11 (Male/Female with emotions)	16 (Male/Female with emotions)
GPU memory	4.8 GB	5.1 GB

Use Zeroshot when you need streaming output or do not have a transcript of the audio prompt. Use Flow when you have a transcript and want higher fidelity voice matching in offline mode.

Prepare the Audio Prompt#

The audio prompt is the reference recording whose voice characteristics the model will replicate.

Requirements:

Format: 16-bit mono WAV
Sample rate: 22.05 kHz or higher
Duration: 3–10 seconds (aim for approximately 5 seconds)
Content: Clear speech with minimal background noise

Tips for higher quality:

Trim silence from the beginning and end so speech fills most of the prompt.
Record in a quiet, echo-free environment.
Use consistent volume throughout the recording.
Avoid music, sound effects, or overlapping speakers.

Synthesize with Magpie TTS Zeroshot#

Offline (gRPC)#

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "This speech uses a cloned voice from my audio prompt." \
    --zero_shot_audio_prompt_file prompt.wav \
    --output output.wav

Offline (HTTP)#

curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
    -F language=en-US \
    -F text="This speech uses a cloned voice from my audio prompt." \
    -F audio_prompt=@prompt.wav \
    --output output.wav

Note

The @ prefix on the file path is required by curl for file uploads.

Streaming (gRPC)#

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "This speech uses a cloned voice from my audio prompt." \
    --zero_shot_audio_prompt_file prompt.wav \
    --stream \
    --output output.wav

Streaming (WebSocket)#

The WebSocket client uses hyphens instead of underscores for argument names.

python3 python-clients/scripts/tts/realtime_tts_client.py \
    --server localhost:9000 \
    --language-code en-US \
    --text "This speech uses a cloned voice from my audio prompt." \
    --zero-shot-audio-prompt-file prompt.wav \
    --output output.wav

Synthesize with Magpie TTS Flow#

Magpie TTS Flow requires a transcript of the audio prompt in addition to the audio file. This model supports offline inference only.

Offline (gRPC)#

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "This speech uses a cloned voice from my audio prompt." \
    --zero_shot_audio_prompt_file prompt.wav \
    --zero_shot_transcript "The exact words spoken in the audio prompt file." \
    --output output.wav

Offline (HTTP)#

curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
    -F language=en-US \
    -F text="This speech uses a cloned voice from my audio prompt." \
    -F audio_prompt=@prompt.wav \
    -F audio_prompt_transcript="The exact words spoken in the audio prompt file." \
    --output output.wav

Important

The --zero_shot_transcript value must match the spoken content of the audio prompt exactly. Mismatched transcripts degrade voice cloning quality.

Tune Quality Settings#

The Magpie TTS Zeroshot model accepts a --zero_shot_quality parameter (range: 1–40, default: 20) that controls the trade-off between synthesis speed and voice similarity.

Lower values (1–10): Faster synthesis, lower voice fidelity.
Default (20): Balanced quality and speed.
Higher values (21–40): Slower synthesis, closer voice match.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "Higher quality voice cloning." \
    --zero_shot_audio_prompt_file prompt.wav \
    --zero_shot_quality 30 \
    --output output.wav

For the WebSocket client, the equivalent flag is --zero-shot-prompt-quality.

Use Built-In Voices Instead#

Both models also include built-in voices that do not require an audio prompt. Specify a voice name with the --voice flag instead of an audio prompt file.

Magpie TTS Zeroshot

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "Using a built-in voice." \
    --voice Magpie-ZeroShot.Female-1 \
    --output output.wav

Available voices: Magpie-ZeroShot.Female-1 (default), Female-Neutral, Female-Angry, Female-Fearful, Female-Calm, Female-Happy, Male-1, Male-Calm, Male-Neutral, Male-Angry, Male-Fearful.

Magpie TTS Flow

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "Using a built-in voice." \
    --voice English-US-Magpie-Flow.Female-1 \
    --output output.wav

Available voices: English-US-Magpie-Flow.Female-1 (default), Female.Calm, Female.Fearful, Female.Happy, Female.Neutral, Female.Angry, Female.Disgusted, Female.Sad, Male-1, Male.Calm, Male.Fearful, Male.Happy, Male.Neutral, Male.Angry, Male.Disgusted, Male.Sad.

For the complete voice list, refer to the TTS support matrix.

Key Differences Between Scripts#

The two Python client scripts use different argument naming conventions:

Parameter	`talk.py` (gRPC)	`realtime_tts_client.py` (WebSocket)
Audio prompt	`--zero_shot_audio_prompt_file`	`--zero-shot-audio-prompt-file`
Transcript	`--zero_shot_transcript`	`--zero-shot-audio-prompt-transcript`
Quality	`--zero_shot_quality`	`--zero-shot-prompt-quality`
Server	`--server 0.0.0.0:50051`	`--server localhost:9000`

Limitations#

Magpie TTS Flow does not support streaming synthesis. Only offline mode is available.
Streaming with talk.py (synthesize_online) does not support the --zero_shot_transcript parameter. Use offline mode for Magpie TTS Flow.
Audio prompts shorter than 3 seconds or longer than 10 seconds can produce lower quality results.