TTS Zero Shot#

Riva brings in Zero-shot TTS capabilities with P-Flow. Pflow is a fast and efficient flow-based TTS model that can adapt to a new voice with as little as 3 seconds of audio data. Pflow uses a speech-prompted text encoder for speaker adaptation and flow matching generative decoder for high-quality and fast speech synthesis.

Note

The Zero Shot Riva TTS model is currently under limited early access.

OOTB Voices#

Language	Model	Dataset	G2P	Gender	Voices
en-US	P-Flow HiFi-GAN	English-US	IPA	Multi-speaker	`English-US-Pflow.Female.neutral` `English-US-Pflow.Female.angry` `English-US-Pflow.Female.fearful` `English-US-Pflow.Female-1` `English-US-Pflow.Female.disgusted` `English-US-Pflow.Female.calm` `English-US-Pflow.Female.happy` `English-US-Pflow.Female.sad` `English-US-Pflow.Male.calm` `English-US-Pflow.Male.happy` `English-US-Pflow.Male.sad` `English-US-Pflow.Male.disgusted` `English-US-Pflow.Male.neutral` `English-US-Pflow.Male.angry` `English-US-Pflow.Male.fearful` `English-US-Pflow.Male-1`

Setting up the TTS Python API parameters#

sample_rate_hz = 44100
req = { 
        "language_code"  : "en-US",
        "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,        # LINEAR_PCM and OGGOPUS encodings are supported
        "sample_rate_hz" : sample_rate_hz,                               # Generate 44.1KHz audio
        "voice_name"     : "English-US-Pflow.Female.neutral",            # The name of the voice to generate
        "audio_prompt_encoding" : riva.client.AudioEncoding.LINEAR_PCM , # LINEAR_PCM and OGGOPUS encodings are supported
        "quality"  : 20                                                  # Number of times to iterate over while generating mels 
        "audio_prompt_file" : ""                                         # Path to the file containing the speech prompt
}

Understanding TTS Python API parameters#

Riva TTS supports a number of options while making a text-to-speech request to the gRPC endpoint, as shown above. Let’s learn more about these parameters:

language_code - Language code of the generated audio
encoding - Type of audio encoding to generate. LINEAR_PCM and OGGOPUS encodings are supported.
sample_rate_hz - Sample rate of the generated audio. Depends on the output audio device and is usually 22khz or 44khz.
voice_name - Voice used to synthesize the audio. Currently, Riva offers two OOTB default voices with emotions.
audio_prompt_encoding - Type of audio encoding of the speech prompt. LINEAR_PCM and OGGOPUS encodings are supported.
quality - Number of iterations on the decoder for the mel generation. In range of 1-40.
audio_prompt_file - Speech Prompt audio file path. In case both, voice name and speech prompt are passed the speech prompt will be used.

Make a gRPC request to the Riva server#

For batch inference mode, use synthesize. Results are returned when the entire audio is synthesized.

req["text"] = "Is it recognize speech or wreck a nice beach?"
resp = riva_tts.synthesize(**req)
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.Audio(audio_samples, rate=sample_rate_hz)

For online inference, use synthesize_online. Results are returned in chunks as they are synthesized.

req["text"] = "Is it recognize speech or wreck a nice beach?"
resp = riva_tts.synthesize_online(**req)
empty = np.array([])
for i, rep in enumerate(resp):
    audio_samples = np.frombuffer(rep.audio, dtype=np.int16) / (2**15)
    print("Chunk: ",i)
    ipd.display(ipd.Audio(audio_samples, rate=44100))
    empty = np.concatenate((empty, audio_samples))

print("Final synthesis:")
ipd.display(ipd.Audio(empty, rate=44100))

Binary Client Example#

Binary clients supported in the Docker image can also be used as follows:

Binary TTS Client Example#

riva_tts_client --text="I had a dream yesterday." --audio_file=/opt/riva/wav/output.wav --zero_shot_audio_prompt=<Path to the audio file> --zero_shot_quality=20

OR using the voice_name param to use any of the OOTB Voices. In case both, voice name and speech prompt are passed the speech prompt will be used.

riva_tts_client --text="I had a dream yesterday." --audio_file=/opt/riva/wav/output.wav --voice_name="English-US-Pflow.Female.neutral"

Binary TTS Performance Client Example#

Binary TTS client applies the same speech prompt over all the input queries in the test_file

riva_tts_perf_client --text_file=/work/test_files/tts/ljs_audio_text_test_filelist_small.txt --voice_name="English-US-Pflow.Female.neutral"

NVIDIA Riva

TTS Zero Shot

Contents

TTS Zero Shot#

OOTB Voices#

Setting up the TTS Python API parameters#

Understanding TTS Python API parameters#

Make a gRPC request to the Riva server#

Binary Client Example#

Binary TTS Client Example#

Binary TTS Performance Client Example#