TTS Zero Shot#

Riva brings in Zero-shot TTS capabilities with P-Flow. Pflow is a fast and efficient flow-based TTS model that can adapt to a new voice with as little as 3 seconds of audio data. Pflow uses a speech-prompted text encoder for speaker adaptation and flow matching generative decoder for high-quality and fast speech synthesis.

Note

The Zero Shot Riva TTS model is currently under limited early access.

OOTB Voices#

Language

Model

Dataset

G2P

Gender

Voices

en-US

P-Flow HiFi-GAN

English-US

IPA

Multi-speaker

English-US-Pflow-Beta.Female.neutral English-US-Pflow-Beta.Female.angry English-US-Pflow-Beta.Female.fearful English-US-Pflow-Beta.Female.personal English-US-Pflow-Beta.Female.disgusted English-US-Pflow-Beta.Female.calm English-US-Pflow-Beta.Female.happy English-US-Pflow-Beta.Female.sad English-US-Pflow-Beta.Male.calm English-US-Pflow-Beta.Male.happy English-US-Pflow-Beta.Male.sad English-US-Pflow-Beta.Male.disgusted English-US-Pflow-Beta.Male.neutral English-US-Pflow-Beta.Male.angry English-US-Pflow-Beta.Male.fearful English-US-Pflow-Beta.Male.personal

Setting up the TTS Python API parameters#

sample_rate_hz = 44100
req = { 
        "language_code"  : "en-US",
        "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,        # LINEAR_PCM and OGGOPUS encodings are supported
        "sample_rate_hz" : sample_rate_hz,                               # Generate 44.1KHz audio
        "voice_name"     : "English-US-Pflow-Beta.Female.neutral",       # The name of the voice to generate
        "audio_prompt_encoding" : riva.client.AudioEncoding.LINEAR_PCM , # LINEAR_PCM and OGGOPUS encodings are supported
        "quality"  : 20                                                  # Number of times to iterate over while generating mels 
        "audio_prompt_file" : ""                                         # Path to the file containing the speech prompt
}

Understanding TTS Python API parameters#

Riva TTS supports a number of options while making a text-to-speech request to the gRPC endpoint, as shown above. Let’s learn more about these parameters:

  • language_code - Language code of the generated audio

  • encoding - Type of audio encoding to generate. LINEAR_PCM and OGGOPUS encodings are supported.

  • sample_rate_hz - Sample rate of the generated audio. Depends on the output audio device and is usually 22khz or 44khz.

  • voice_name - Voice used to synthesize the audio. Currently, Riva offers two OOTB default voices with emotions.

  • audio_prompt_encoding - Type of audio encoding of the speech prompt. LINEAR_PCM and OGGOPUS encodings are supported.

  • quality - Number of iterations on the decoder for the mel generation. In range of 1-40.

  • audio_prompt_file - Speech Prompt audio file path. In case both, voice name and speech prompt are passed the speech prompt will be used.

Make a gRPC request to the Riva server#

For batch inference mode, use synthesize. Results are returned when the entire audio is synthesized.

req["text"] = "Is it recognize speech or wreck a nice beach?"
resp = riva_tts.synthesize(**req)
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.Audio(audio_samples, rate=sample_rate_hz)

For online inference, use synthesize_online. Results are returned in chunks as they are synthesized.

req["text"] = "Is it recognize speech or wreck a nice beach?"
resp = riva_tts.synthesize_online(**req)
empty = np.array([])
for i, rep in enumerate(resp):
    audio_samples = np.frombuffer(rep.audio, dtype=np.int16) / (2**15)
    print("Chunk: ",i)
    ipd.display(ipd.Audio(audio_samples, rate=44100))
    empty = np.concatenate((empty, audio_samples))

print("Final synthesis:")
ipd.display(ipd.Audio(empty, rate=44100))

Binary Client Example#

Binary clients supported in the Docker image can also be used as follows:

Binary TTS Client Example#

riva_tts_client --text="I had a dream yesterday." --audio_file=/opt/riva/wav/output.wav --zero_shot_audio_prompt=<Path to the audio file> --zero_shot_quality=20

OR using the voice_name param to use any of the OOTB Voices. In case both, voice name and speech prompt are passed the speech prompt will be used.

riva_tts_client --text="I had a dream yesterday." --audio_file=/opt/riva/wav/output.wav --voice_name="English-US-Pflow-Beta.Female.neutral"

Binary TTS Performance Client Example#

Binary TTS client applies the same speech prompt over all the input queries in the test_file

riva_tts_perf_client --text_file=/work/test_files/tts/ljs_audio_text_test_filelist_small.txt --voice_name="English-US-Pflow-Beta.Female.neutral"