Cloning a Voice with Zero-Shot TTS#
The Magpie TTS Zeroshot and Magpie TTS Flow models synthesize speech that matches the voice characteristics of a short reference audio recording. This guide covers how to prepare an audio prompt, tune quality settings, and synthesize cloned speech through gRPC, HTTP, and WebSocket.
Prerequisites#
A deployed TTS NIM running either the Magpie TTS Zeroshot or Magpie TTS Flow model. Refer to the TTS tutorial for deployment steps.
Installed the NVIDIA Riva Python client.
Note
Both models require access approval. Request access through this form.
Choose a Model#
Capability |
Magpie TTS Zeroshot |
Magpie TTS Flow |
|---|---|---|
Inference modes |
Streaming + Offline |
Offline only |
Audio prompt |
Required |
Required |
Transcript of prompt |
Not used |
Required |
Quality parameter |
1–40 (default: 20) |
Not applicable |
Built-in voices |
11 (Male/Female with emotions) |
16 (Male/Female with emotions) |
GPU memory |
4.8 GB |
5.1 GB |
Use Zeroshot when you need streaming output or do not have a transcript of the audio prompt. Use Flow when you have a transcript and want higher fidelity voice matching in offline mode.
Prepare the Audio Prompt#
The audio prompt is the reference recording whose voice characteristics the model will replicate.
Requirements:
Format: 16-bit mono WAV
Sample rate: 22.05 kHz or higher
Duration: 3–10 seconds (aim for approximately 5 seconds)
Content: Clear speech with minimal background noise
Tips for higher quality:
Trim silence from the beginning and end so speech fills most of the prompt.
Record in a quiet, echo-free environment.
Use consistent volume throughout the recording.
Avoid music, sound effects, or overlapping speakers.
Synthesize with Magpie TTS Zeroshot#
Offline (gRPC)#
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "This speech uses a cloned voice from my audio prompt." \
--zero_shot_audio_prompt_file prompt.wav \
--output output.wav
Offline (HTTP)#
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
-F language=en-US \
-F text="This speech uses a cloned voice from my audio prompt." \
-F audio_prompt=@prompt.wav \
--output output.wav
Note
The @ prefix on the file path is required by curl for file uploads.
Streaming (gRPC)#
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "This speech uses a cloned voice from my audio prompt." \
--zero_shot_audio_prompt_file prompt.wav \
--stream \
--output output.wav
Streaming (WebSocket)#
The WebSocket client uses hyphens instead of underscores for argument names.
python3 python-clients/scripts/tts/realtime_tts_client.py \
--server localhost:9000 \
--language-code en-US \
--text "This speech uses a cloned voice from my audio prompt." \
--zero-shot-audio-prompt-file prompt.wav \
--output output.wav
Synthesize with Magpie TTS Flow#
Magpie TTS Flow requires a transcript of the audio prompt in addition to the audio file. This model supports offline inference only.
Offline (gRPC)#
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "This speech uses a cloned voice from my audio prompt." \
--zero_shot_audio_prompt_file prompt.wav \
--zero_shot_transcript "The exact words spoken in the audio prompt file." \
--output output.wav
Offline (HTTP)#
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
-F language=en-US \
-F text="This speech uses a cloned voice from my audio prompt." \
-F audio_prompt=@prompt.wav \
-F audio_prompt_transcript="The exact words spoken in the audio prompt file." \
--output output.wav
Important
The --zero_shot_transcript value must match the spoken content of the audio prompt exactly. Mismatched transcripts degrade voice cloning quality.
Tune Quality Settings#
The Magpie TTS Zeroshot model accepts a --zero_shot_quality parameter (range: 1–40, default: 20) that controls the trade-off between synthesis speed and voice similarity.
Lower values (1–10): Faster synthesis, lower voice fidelity.
Default (20): Balanced quality and speed.
Higher values (21–40): Slower synthesis, closer voice match.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Higher quality voice cloning." \
--zero_shot_audio_prompt_file prompt.wav \
--zero_shot_quality 30 \
--output output.wav
For the WebSocket client, the equivalent flag is --zero-shot-prompt-quality.
Use Built-In Voices Instead#
Both models also include built-in voices that do not require an audio prompt. Specify a voice name with the --voice flag instead of an audio prompt file.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Using a built-in voice." \
--voice Magpie-ZeroShot.Female-1 \
--output output.wav
Available voices: Magpie-ZeroShot.Female-1 (default), Female-Neutral, Female-Angry, Female-Fearful, Female-Calm, Female-Happy, Male-1, Male-Calm, Male-Neutral, Male-Angry, Male-Fearful.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Using a built-in voice." \
--voice English-US-Magpie-Flow.Female-1 \
--output output.wav
Available voices: English-US-Magpie-Flow.Female-1 (default), Female.Calm, Female.Fearful, Female.Happy, Female.Neutral, Female.Angry, Female.Disgusted, Female.Sad, Male-1, Male.Calm, Male.Fearful, Male.Happy, Male.Neutral, Male.Angry, Male.Disgusted, Male.Sad.
For the complete voice list, refer to the TTS support matrix.
Key Differences Between Scripts#
The two Python client scripts use different argument naming conventions:
Parameter |
|
|
|---|---|---|
Audio prompt |
|
|
Transcript |
|
|
Quality |
|
|
Server |
|
|
Limitations#
Magpie TTS Flow does not support streaming synthesis. Only offline mode is available.
Streaming with
talk.py(synthesize_online) does not support the--zero_shot_transcriptparameter. Use offline mode for Magpie TTS Flow.Audio prompts shorter than 3 seconds or longer than 10 seconds can produce lower quality results.