Cloning a Voice with Zero-Shot TTS#

The Magpie TTS Zeroshot and Magpie TTS Flow models synthesize speech that matches the voice characteristics of a short reference audio recording. This guide covers how to prepare an audio prompt, tune quality settings, and synthesize cloned speech through gRPC, HTTP, and WebSocket.

Prerequisites#

Note

Both models require access approval. Request access through this form.

Choose a Model#

Capability

Magpie TTS Zeroshot

Magpie TTS Flow

Inference modes

Streaming + Offline

Offline only

Audio prompt

Required

Required

Transcript of prompt

Not used

Required

Quality parameter

1–40 (default: 20)

Not applicable

Built-in voices

11 (Male/Female with emotions)

16 (Male/Female with emotions)

GPU memory

4.8 GB

5.1 GB

Use Zeroshot when you need streaming output or do not have a transcript of the audio prompt. Use Flow when you have a transcript and want higher fidelity voice matching in offline mode.

Prepare the Audio Prompt#

The audio prompt is the reference recording whose voice characteristics the model will replicate.

Requirements:

  • Format: 16-bit mono WAV

  • Sample rate: 22.05 kHz or higher

  • Duration: 3–10 seconds (aim for approximately 5 seconds)

  • Content: Clear speech with minimal background noise

Tips for higher quality:

  • Trim silence from the beginning and end so speech fills most of the prompt.

  • Record in a quiet, echo-free environment.

  • Use consistent volume throughout the recording.

  • Avoid music, sound effects, or overlapping speakers.

Synthesize with Magpie TTS Zeroshot#

Offline (gRPC)#

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "This speech uses a cloned voice from my audio prompt." \
    --zero_shot_audio_prompt_file prompt.wav \
    --output output.wav

Offline (HTTP)#

curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
    -F language=en-US \
    -F text="This speech uses a cloned voice from my audio prompt." \
    -F audio_prompt=@prompt.wav \
    --output output.wav

Note

The @ prefix on the file path is required by curl for file uploads.

Streaming (gRPC)#

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "This speech uses a cloned voice from my audio prompt." \
    --zero_shot_audio_prompt_file prompt.wav \
    --stream \
    --output output.wav

Streaming (WebSocket)#

The WebSocket client uses hyphens instead of underscores for argument names.

python3 python-clients/scripts/tts/realtime_tts_client.py \
    --server localhost:9000 \
    --language-code en-US \
    --text "This speech uses a cloned voice from my audio prompt." \
    --zero-shot-audio-prompt-file prompt.wav \
    --output output.wav

Synthesize with Magpie TTS Flow#

Magpie TTS Flow requires a transcript of the audio prompt in addition to the audio file. This model supports offline inference only.

Offline (gRPC)#

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "This speech uses a cloned voice from my audio prompt." \
    --zero_shot_audio_prompt_file prompt.wav \
    --zero_shot_transcript "The exact words spoken in the audio prompt file." \
    --output output.wav

Offline (HTTP)#

curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
    -F language=en-US \
    -F text="This speech uses a cloned voice from my audio prompt." \
    -F audio_prompt=@prompt.wav \
    -F audio_prompt_transcript="The exact words spoken in the audio prompt file." \
    --output output.wav

Important

The --zero_shot_transcript value must match the spoken content of the audio prompt exactly. Mismatched transcripts degrade voice cloning quality.

Tune Quality Settings#

The Magpie TTS Zeroshot model accepts a --zero_shot_quality parameter (range: 1–40, default: 20) that controls the trade-off between synthesis speed and voice similarity.

  • Lower values (1–10): Faster synthesis, lower voice fidelity.

  • Default (20): Balanced quality and speed.

  • Higher values (21–40): Slower synthesis, closer voice match.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "Higher quality voice cloning." \
    --zero_shot_audio_prompt_file prompt.wav \
    --zero_shot_quality 30 \
    --output output.wav

For the WebSocket client, the equivalent flag is --zero-shot-prompt-quality.

Use Built-In Voices Instead#

Both models also include built-in voices that do not require an audio prompt. Specify a voice name with the --voice flag instead of an audio prompt file.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "Using a built-in voice." \
    --voice Magpie-ZeroShot.Female-1 \
    --output output.wav

Available voices: Magpie-ZeroShot.Female-1 (default), Female-Neutral, Female-Angry, Female-Fearful, Female-Calm, Female-Happy, Male-1, Male-Calm, Male-Neutral, Male-Angry, Male-Fearful.

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --language-code en-US \
    --text "Using a built-in voice." \
    --voice English-US-Magpie-Flow.Female-1 \
    --output output.wav

Available voices: English-US-Magpie-Flow.Female-1 (default), Female.Calm, Female.Fearful, Female.Happy, Female.Neutral, Female.Angry, Female.Disgusted, Female.Sad, Male-1, Male.Calm, Male.Fearful, Male.Happy, Male.Neutral, Male.Angry, Male.Disgusted, Male.Sad.

For the complete voice list, refer to the TTS support matrix.

Key Differences Between Scripts#

The two Python client scripts use different argument naming conventions:

Parameter

talk.py (gRPC)

realtime_tts_client.py (WebSocket)

Audio prompt

--zero_shot_audio_prompt_file

--zero-shot-audio-prompt-file

Transcript

--zero_shot_transcript

--zero-shot-audio-prompt-transcript

Quality

--zero_shot_quality

--zero-shot-prompt-quality

Server

--server 0.0.0.0:50051

--server localhost:9000

Limitations#

  • Magpie TTS Flow does not support streaming synthesis. Only offline mode is available.

  • Streaming with talk.py (synthesize_online) does not support the --zero_shot_transcript parameter. Use offline mode for Magpie TTS Flow.

  • Audio prompts shorter than 3 seconds or longer than 10 seconds can produce lower quality results.