Following sections demonstrate various TTS models using gRPC and HTTP APIs using sample Python clients and curl command respectively.
The Magpie TTS Multilingual model supports text to speech in multiple languages.
Ensure that you have deployed the Magpie TTS Multilingual model, by referring to Supported Models section.
The following sections show how to use the model with a sample Python client and curl commands for the gRPC and HTTP APIs, respectively.
List available models and voices
python3 python-clients/scripts/tts/talk.py \
--server 0.0.0.0:50051 \
--list-voices
curl -sS http://localhost:9000/v1/audio/list_voices | jq
Output is piped to jq command to format the JSON string for better readability.
You will see a output with list of voices for supported languages. Output is truncated for brevity.
{
"en-US,es-US,fr-FR,de-DE": {
"voices": [
"Magpie-Multilingual.EN-US.Sofia",
"Magpie-Multilingual.EN-US.Ray",
...
"Magpie-Multilingual.DE-DE.Leo",
"Magpie-Multilingual.DE-DE.Aria"
]
}
}
Synthesize speech with Offline API
With Offline API, entire synthesized speech is returned to the client at once. Synthesized speech will be saved in output.wav.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--voice Magpie-Multilingual.EN-US.Sofia \
--output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
-F language=en-US \
-F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
-F voice=Magpie-Multilingual.EN-US.Sofia \
--output output.wav
It is possible to use intermix the voices and languages to generate voice with different accents. For example, one can synthesize English speech with French accent with below command.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--voice Magpie-Multilingual.FR-FR.Pascal \
--output output.wav
Note
By default, gRPC limits incoming message size to 4 MB. As the Offline API returns synthesized speech in a single chunk, an error will occur if the synthesized speech exceeds this size. In such cases, we recommend using the Streaming API instead.
Synthesize speech with Streaming API
With Streaming API, the synthesized speech is returned in chunks as they are synthesized. Streaming API is recommended in real-time applications which require lowest latency.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--voice Magpie-Multilingual.EN-US.Sofia \
--stream \
--output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize_online --fail-with-body \
-F language=en-US \
-F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
-F voice=Magpie-Multilingual.EN-US.Sofia \
-F sample_rate_hz=22050 \
--output output.raw
Streaming HTTP API output is in RAW LPCM format without WAV header. Tool like sox can be used to prefix a WAV header and save it as WAV file.
sox -b 16 -e signed -c 1 -r 22050 output.raw output.wav
The Magpie TTS Zeroshot model supports text to speech in English using an audio prompt. Voice characteristics from the audio prompt are applied to the synthesized output speech. This model supports streaming and offline inference.
The following sections show how to use the model with a sample Python client and curl commands for the gRPC and HTTP APIs, respectively.
Make sure you have deployed the Magpie TTS Zeroshot, by referring to Supported Models section.
You can create a audio prompt using any voice recording application.
- Guidelines for creating an effective audio prompt
Audio format must be 16-bit Mono WAV file with a sample rate of 22.05 kHz or higher.
Aim for a duration of five seconds.
Trim silence from the beginning and end so that speech fills most of the prompt.
Record the prompt in a noise-free environment.
The following commands use the sample audio prompt provided at python-clients/data/examples/sample_audio_prompt.wav. Synthesized speech will be saved in output.wav and the voice will have characteristics similar to those in the provided audio prompt. If you are using your own audio prompt, make sure to update the value passed to the –zero_shot_audio_prompt_file argument.
Synthesize speech with Offline API
With Offline API, entire synthesized speech is returned to the client at once.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--zero_shot_audio_prompt_file python-clients/data/examples/sample_audio_prompt.wav \
--output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
-F language=en-US \
-F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
-F audio_prompt=@$HOME/python-clients/data/examples/sample_audio_prompt.wav \
--output output.wav
Note
‘@’ is mandatory in the HTTP ‘audio_prompt’ parameter as per curl command syntax.
Synthesize speech with Streaming API
With Streaming API, the synthesized speech is returned in chunks as they are synthesized. Streaming API is recommended in real-time applications which require lowest latency.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--zero_shot_audio_prompt_file python-clients/data/examples/sample_audio_prompt.wav \
--stream \
--output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize_online --fail-with-body \
-F language=en-US \
-F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
-F audio_prompt=@$HOME/python-clients/data/examples/sample_audio_prompt.wav \
--output output.wav
Note
‘@’ is mandatory in the HTTP ‘audio_prompt’ parameter as per curl command syntax.
The Magpie TTS Flow model supports text to speech in English using an audio prompt and prompt transcript text. Voice characteristics from the audio prompt are applied to the synthesized output speech. Compared to Magpie TTS Zeroshot model, this model requires additional prompt transcript text as input. This model supports only offline inference.
The following sections show how to use the model with a sample Python client and curl commands for the gRPC and HTTP APIs, respectively.
Make sure you have deployed the Magpie TTS Flow, by referring to Supported Models section.
Audio prompt can be created using any voice recording application.
- Guidelines for creating an effective audio prompt
Audio format must be 16-bit Mono WAV file with a sample rate of 22.05 kHz or higher.
Aim for a duration of five seconds.
Trim silence from the beginning and end so that speech fills most of the prompt.
Record the prompt in a noise-free environment.
The following commands use the sample audio prompt provided at python-clients/data/examples/sample_audio_prompt.wav. Synthesized speech will be saved in output.wav and the voice will have characteristics similar to those in the provided audio prompt. If you are using your own audio prompt, make sure to update the value passed to the –zero_shot_audio_prompt_file and –zero_shot_transcript arguments.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--zero_shot_audio_prompt_file python-clients/data/examples/sample_audio_prompt.wav \
--zero_shot_transcript "I consent to use my voice to create a synthetic voice." \
--output output.wav
curl -sS http://localhost:9000/v1/audio/synthesize --fail-with-body \
-F language=en-US \
-F text="Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
-F sample_rate_hz=22050 \
-F audio_prompt="@$HOME/python-clients/data/examples/sample_audio_prompt.wav" \
-F audio_prompt_transcript="I consent to use my voice to create a synthetic voice." \
--output output.wav
Note
The Magpie TTS Flow model supports only offline APIs.
‘@’ is mandatory in the HTTP ‘audio_prompt’ parameter as per curl command syntax.
The Fastpitch HifiGAN TTS model supports text to speech only in English (en-US) language.
Ensure that you have deployed the Fastpitch HifiGAN TTS model, by referring to Supported Models section.
List available models and voices
python3 python-clients/scripts/tts/talk.py \
--server 0.0.0.0:50051 \
--list-voices
Synthesize speech with Offline API
With Offline API, entire synthesized speech is returned to the client at once. Synthesized speech will be saved in output.wav.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--output output.wav
Note
By default, gRPC limits incoming message size to 4 MB. As the Offline API returns synthesized speech in a single chunk, an error will occur if the synthesized speech exceeds this size. In such cases, we recommend using the Streaming API instead.
Synthesize speech with Streaming API
With Streaming API, the synthesized speech is returned in chunks as they are synthesized. Streaming API is recommended in real-time applications which require lowest latency.
python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
--language-code en-US \
--text "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion." \
--stream \
--output output.wav