TTS Zero Shot
Contents
TTS Zero Shot#
Riva brings in Zero-shot TTS capabilities with P-Flow. Pflow is a fast and efficient flow-based TTS model that can adapt to a new voice with as little as 3 seconds of audio data. Pflow uses a speech-prompted text encoder for speaker adaptation and flow matching generative decoder for high-quality and fast speech synthesis.
Note
The Zero Shot Riva TTS model is currently under limited early access.
OOTB Voices#
Language |
Model |
Dataset |
G2P |
Gender |
Voices |
---|---|---|---|---|---|
en-US |
P-Flow HiFi-GAN |
English-US |
IPA |
Multi-speaker |
|
Setting up the TTS Python API parameters#
sample_rate_hz = 44100
req = {
"language_code" : "en-US",
"encoding" : riva.client.AudioEncoding.LINEAR_PCM , # LINEAR_PCM and OGGOPUS encodings are supported
"sample_rate_hz" : sample_rate_hz, # Generate 44.1KHz audio
"voice_name" : "English-US-Pflow.Female.neutral", # The name of the voice to generate
"audio_prompt_encoding" : riva.client.AudioEncoding.LINEAR_PCM , # LINEAR_PCM and OGGOPUS encodings are supported
"quality" : 20 # Number of times to iterate over while generating mels
"audio_prompt_file" : "" # Path to the file containing the speech prompt
}
Understanding TTS Python API parameters#
Riva TTS supports a number of options while making a text-to-speech request to the gRPC endpoint, as shown above. Let’s learn more about these parameters:
language_code
- Language code of the generated audioencoding
- Type of audio encoding to generate. LINEAR_PCM and OGGOPUS encodings are supported.sample_rate_hz
- Sample rate of the generated audio. Depends on the output audio device and is usually 22khz or 44khz.voice_name
- Voice used to synthesize the audio. Currently, Riva offers two OOTB default voices with emotions.audio_prompt_encoding
- Type of audio encoding of the speech prompt. LINEAR_PCM and OGGOPUS encodings are supported.quality
- Number of iterations on the decoder for the mel generation. In range of 1-40.audio_prompt_file
- Speech Prompt audio file path. In case both, voice name and speech prompt are passed the speech prompt will be used.
Make a gRPC request to the Riva server#
For batch inference mode, use synthesize. Results are returned when the entire audio is synthesized.
req["text"] = "Is it recognize speech or wreck a nice beach?"
resp = riva_tts.synthesize(**req)
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.Audio(audio_samples, rate=sample_rate_hz)
For online inference, use synthesize_online. Results are returned in chunks as they are synthesized.
req["text"] = "Is it recognize speech or wreck a nice beach?"
resp = riva_tts.synthesize_online(**req)
empty = np.array([])
for i, rep in enumerate(resp):
audio_samples = np.frombuffer(rep.audio, dtype=np.int16) / (2**15)
print("Chunk: ",i)
ipd.display(ipd.Audio(audio_samples, rate=44100))
empty = np.concatenate((empty, audio_samples))
print("Final synthesis:")
ipd.display(ipd.Audio(empty, rate=44100))
Binary Client Example#
Binary clients supported in the Docker image can also be used as follows:
Binary TTS Client Example#
riva_tts_client --text="I had a dream yesterday." --audio_file=/opt/riva/wav/output.wav --zero_shot_audio_prompt=<Path to the audio file> --zero_shot_quality=20
OR using the voice_name param to use any of the OOTB Voices. In case both, voice name and speech prompt are passed the speech prompt will be used.
riva_tts_client --text="I had a dream yesterday." --audio_file=/opt/riva/wav/output.wav --voice_name="English-US-Pflow.Female.neutral"
Binary TTS Performance Client Example#
Binary TTS client applies the same speech prompt over all the input queries in the test_file
riva_tts_perf_client --text_file=/work/test_files/tts/ljs_audio_text_test_filelist_small.txt --voice_name="English-US-Pflow.Female.neutral"