How do I use Riva TTS APIs with out-of-the-box models?#

This tutorial walks you through the basics of Riva’s TTS services, specifically covering how to use Riva TTS APIs with OOTB (out-of-the-box) models.

NVIDIA Riva Overview#

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

  • Automated speech recognition (ASR)

  • Text-to-Speech synthesis (TTS)

  • A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will interact with the text-to-speech synthesis (TTS) APIs and customize Riva TTS audio output with SSML.

For more information about Riva, please refer to the Riva developer website.

Basics: Generating Speech with Riva TTS APIs#

The Riva TTS service is based on a two-stage pipeline: Riva generates a mel spectrogram using the first model, then uses the mel spectrogram to generate speech using the second model. This pipeline forms a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.

Riva provides two state-of-the-art voices (one male and one female) for English, that can easily be deployed with the Riva Quick Start scripts. Riva also supports easy customization of TTS in various ways, to meet your specific needs.
Subsequent Riva releases will include features such as model registration to support multiple languages/voices with the same API and support for resampling to alternative sampling rates.
Refer to the Riva TTS documentation for more information.

Now, let’s generate audio using Riva APIs with an OOTB English TTS pipeline.

Requirements and setup#

  1. Start the Riva server. Follow the instructions in the Riva Quick Start Guide to deploy OOTB TTS models on the Riva server before running this tutorial. By default, only the English models are deployed.

  2. Install the additional Python libraries to run this tutorial. Run the following commands to install the libraries:

# We need numpy to read the output from the Riva TTS request.
!pip install numpy
!pip install nvidia-riva-client

Import Riva client libraries#

We first import some required libraries, including the Riva client libraries.

import numpy as np
import IPython.display as ipd
import riva.client

Create Riva clients and connect to the Riva server#

The following URI assumes a local deployment of the Riva server on the default port. In case the server deployment is on a different host or via Helm chart on Kubernetes, use an appropriate URI.

auth = riva.client.Auth(uri='localhost:50051')

riva_tts = riva.client.SpeechSynthesisService(auth)

TTS modes#

Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. But when making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.

Setup the TTS API parameters#

sample_rate_hz = 44100
req = { 
        "language_code"  : "en-US",
        "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,   # LINEAR_PCM and OGGOPUS encodings are supported
        "sample_rate_hz" : sample_rate_hz,                          # Generate 44.1KHz audio
        "voice_name"     : "English-US.Female-1"                    # The name of the voice to generate
}

Understanding TTS API parameters#

Riva TTS supports a number of options while making a text-to-speech request to the gRPC endpoint, as shown above. Let’s learn more about these parameters:

  • language_code - Language of the generated audio. en-US represents English (US) and is currently the only language supported OOTB.

  • encoding - Type of audio encoding to generate. LINEAR_PCM and OGGOPUS encodings are supported.

  • sample_rate_hz - Sample rate of the generated audio. Depends on the microphone and is usually 22khz or 44khz.

  • voice_name - Voice used to synthesize the audio. Currently, Riva offers two OOTB voices (English-US.Female-1, English-US.Male-1).

Make a gRPC request to the Riva server#

For batch inference mode, use synthesize. Results are returned when the entire audio is synthesized.

req["text"] = "Is it recognize speech or wreck a nice beach?"
resp = riva_tts.synthesize(**req)
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.Audio(audio_samples, rate=sample_rate_hz)

For online inference, use synthesize_online. Results are returned in chunks as they are synthesized.

req["text"] = "Is it recognize speech or wreck a nice beach?"
resp = riva_tts.synthesize_online(**req)
empty = np.array([])
for i, rep in enumerate(resp):
    audio_samples = np.frombuffer(rep.audio, dtype=np.int16) / (2**15)
    print("Chunk: ",i)
    ipd.display(ipd.Audio(audio_samples, rate=44100))
    empty = np.concatenate((empty, audio_samples))

print("Final synthesis:")
ipd.display(ipd.Audio(empty, rate=44100))

Customizing Riva TTS audio output with SSML#

Speech Synthesis Markup Language (SSML) specification is a markup for directing the performance of the virtual speaker. Riva supports portions of SSML, allowing you to adjust pitch, rate, and pronunciation of the generated audio.

All SSML inputs must be a valid XML document and use the root tag. All non-valid XML and all valid XML with a different root tag are treated as raw input text.

Riva TTS supports the following SSML tags:

  • The prosody tag, which supports attributes rate, pitch, and volume, through which we can control the rate, pitch, and volume of the generated audio.

  • The phoneme tag, which allows us to control the pronunciation of the generated audio.

  • The sub tag, which allows us to replace the pronounciation of the specified word or phrase with a different word or phrase.

Let’s look at customization of Riva TTS with these SSML tags in some detail.

Customizing rate, pitch, and volume with the prosody tag#

Pitch Attribute#

Riva supports an additive relative change to the pitch. The pitch attribute has a range of [-3, 3] or [-150, 150] Hz. Values outside this range result in an error being logged and no audio returned.

When using an absolute value that doesn’t end in Hz, pitch is shifted by that value multiplied with the speaker’s pitch standard deviation as defined in the model configs. For the pretrained checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of 1.25 results in a change of 1.25*52.185=~65.23 Hz pitch shift up.

Riva also supports the following tags as per the SSML specs: x-low, low, medium, high, x-high, and default.

The pitch attribute is expressed in the following formats:

  • pitch="1"

  • pitch="95hZ"

  • pitch="+1.8"

  • pitch="-0.65"

  • pitch="+75Hz"

  • pitch="-84.5Hz"

  • pitch="high"

  • pitch="default"

For the pretrained Female-1 checkpoint, the standard deviation is 53.33 Hz. For the pretrained Male-1 checkpoint, the standard deviation is 47.15 Hz.

The pitch attribute does not support st and % changes.

Pitch is handled differently in FastPitch compared to RadTTS. While both models accept both pitch formats, internally, FastPitch uses normalized pitch, and RadTTS uses unnormalized pitch. If a TTS request uses a RadTTS model and the pitch attribute was supplied in the [-3, 3] format, Riva converts that using the model’s pitch standard deviation into an unnormalized pitch shift. If a TTS request uses a FastPitch model and the pitch attribute was supplied in the [-150, 150] Hz format, Riva converts that using the model’s pitch standard deviation into a normalized pitch shift. In the case where Riva determines the pitch standard deviation from the NeMo model config, a value of 59.02 Hz is used as the pitch standard deviation.

Rate Attribute#

Riva supports a percentage relative change to the rate. The rate attribute has a range of [25%, 250%]. Values outside this range result in an error being logged and no audio returned. Riva also supports the following tags as per the SSML specs: x-low, low, medium, high, x-high, and default.

The rate attribute is expressed in the following formats:

  • rate="35%"

  • rate="+200%"

  • rate="low"

  • rate="default"

Volume Attribute#

Riva supports the volume attribute as described in the SSML specs. The volume attribute supports a range of [-13, 8]dB. Values outside this range result in an error being logged and no audio returned. Tags silent, x-soft, soft, medium, loud, x-loud, and default are supported.

The volume attribute is expressed in the following formats:

  • volume="+1dB"

  • volume="-5.7dB"

  • volume="x-loud"

  • volume="default"

Emotion Attribute#

Riva supports emotion mixing in beta with the emotion attribute as described in the SSML specs. The emotion attribute overwrites the default subvoice emotion in the request and supports mixing weight in floating range of [0.0, 1.0]. Mixing weight tags xlow, low, medium, very and extreme are supported. Currently emotion mixing is only supported in RadTTS++ model.

When an emotion is selected it is mixed in with neutral according to the specified weight to quantize it. For example, happy with a mixing weight of 0.5 is happy extreme mixed in with neutral in 1:1 ratio to get happy:0.5.

The emotion attribute is expressed in the following formats:

  • emotion="sad:1.0,fearful:0.7"

  • emotion="happy:extreme,calm:low"

Let’s look at an example showing the pitch, rate and volume customizations for Riva TTS:

"""
    Raw text is "Today is a sunny day. But it might rain tomorrow."
    We are updating this raw text with SSML:
    1. Envelope raw text in '<speak>' tags as is required for SSML
    2. Add '<prosody>' tag with 'pitch' attribute set to '2.5'
    3. Add '<prosody>' tag with 'rate' attribute set to 'high'
    4. Add '<volume>' tag with 'volume' attribute set to '+1dB'
"""
raw_text = "Today is a sunny day. But it might rain tomorrow."
ssml_text = """<speak><prosody pitch='2.5'>Today is a sunny day</prosody>. <prosody rate='high' volume='+1dB'>But it might rain tomorrow.</prosody></speak>"""
print("Raw Text: ", raw_text)
print("SSML Text: ", ssml_text)


req["text"] = ssml_text
# Request to Riva TTS to synthesize audio
resp = riva_tts.synthesize(**req)

# Playing the generated audio from Riva TTS request
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))

Expected results if you run the tutorial:#

<prosody pitch='2.5'>Today is a sunny day</prosody>. <prosody rate='high' volume='+1dB'>But it might rain tomorrow.</prosody>

Note#

If the audio controls are not seen throughout notebook. Open the notebook in github dev or view it in the riva docs

Here are more examples showing the effects of changes in pitch, rate and emotion attribute values on the generated audio:

# SSML texts we want to try
ssml_texts = [
  """<speak>This is a normal sentence</speak>""",
  """<speak><prosody pitch="0." rate="100%">This is also a normal sentence</prosody></speak>""",
  """<speak><prosody rate="200%">This is a fast sentence</prosody></speak>""",
  """<speak><prosody rate="60%">This is a slow sentence</prosody></speak>""",
  """<speak><prosody pitch="+1.0">Now, I'm speaking a bit higher</prosody></speak>""",
  """<speak><prosody pitch="-0.5">And now, I'm speaking a bit lower</prosody></speak>""",
  """<speak>S S M L supports <prosody pitch="-1">nested tags. So I can speak <prosody rate="150%">faster</prosody>, <prosody rate="75%">or slower</prosody>, as desired.</prosody></speak>""",
  """<speak><prosody volume='x-soft'>I'm speaking softly.</prosody><prosody volume='x-loud'> And now, This is loud.</prosody></speak>""",
]

# Loop through 'ssml_texts' list and synthesize audio with Riva TTS for each entry 'ssml_texts'
for ssml_text in ssml_texts:
    req["text"] = ssml_text
    resp = riva_tts.synthesize(**req)
    audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
    print("SSML Text: ", ssml_text)
    ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))
    print("--------------------------------------------")

Expected results if you run the tutorial:#

This is a normal sentence

<prosody pitch="0." rate="100%">This is also a normal sentence</prosody>

<prosody rate="200%">This is a fast sentence</prosody>

<prosody rate="60%">This is a slow sentence</prosody>

<prosody pitch="+1.0">Now, I'm speaking a bit higher</prosody>

<prosody pitch="-0.5">And now, I'm speaking a bit lower</prosody>

S S M L supports <prosody pitch="-1">nested tags. So I can speak <prosody rate="150%">faster</prosody>, <prosody rate="75%">or slower</prosody>, as desired.</prosody>

<prosody volume='x-soft'>I'm speaking softly.</prosody><prosody volume='x-loud'> And now, This is loud.</prosody>

# Note: This code segment uses the beta radtts model which supports emotion mixing, in case of other models the emotions will be ignored except set via voice_name.

req_emotion = { 
        "language_code"  : "en-US",
        "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,   # LINEAR_PCM and OGGOPUS encodings are supported
        "sample_rate_hz" : sample_rate_hz,                          # Generate 44.1KHz audio
        "voice_name"     : "English-US-RadTTSpp.Male.happy"                    # The name of the voice to generate
}

ssml_text="""<speak> I am happy.<prosody emotion="sad:very"> And now, I am sad.</prosody><prosody emotion="angry:extreme"> This makes me angry.</prosody><prosody emotion="calm:extreme"> And now, I am calm.</prosody></speak>"""
print("SSML Text: ", ssml_text)


req_emotion["text"] = ssml_text
# Request to Riva TTS to synthesize audio
resp = riva_tts.synthesize(**req_emotion)

# Playing the generated audio from Riva TTS request
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))

Expected results if you run the tutorial:#

I am happy.<prosody emotion="sad:very"> And now, I am sad.</prosody><prosody emotion="angry:extreme"> This makes me angry.</prosody><prosody emotion="calm:extreme"> And now, I am calm.</prosody>

Customizing pronunciation with the phoneme tag#

We can use the phoneme tag to override the pronunciation of words from the predicted pronunciation. For a given word or sequence of words, use the ph attribute to provide an explicit pronunciation, and the alphabet attribute to provide the phone set.

Starting with the Riva 2.8.0 release, ipa will be the only supported prounciation alphabet for TTS models. Older Riva models only support x-arpabet.

IPA#

For the full list of supported ipa phonemes, refer to the Riva TTS Phoneme Support page.

Arpabet#

The full list of phonemes in the CMUdict can be found at cmudict.phone. The list of supported symbols with stress can be found at cmudict.symbols. For a mapping of these phones to English sounds, refer to the ARPABET Wikipedia page.

Let’s look at an example showing this custom pronunciation for Riva TTS:

# Setting up Riva TTS request with SynthesizeSpeechRequest
"""
    Raw text is "You say tomato, I say tomato."
    We are updating this raw text with SSML:
    1. Envelope raw text in '<speak>' tags as is required for SSML
    2. For a substring in the raw text, add '<phoneme>' tags with 'alphabet' attribute set to 'x-arpabet' 
       (currently the only supported value) and 'ph' attribute set to a custom pronunciation based on CMUdict and ARPABET

"""
raw_text = "You say tomato, I say tomato."
ssml_text = '<speak>You say <phoneme alphabet="ipa" ph="təˈmeɪˌtoʊ">tomato</phoneme>, I say <phoneme alphabet="ipa" ph="təˈmɑˌtoʊ">tomato</phoneme>.</speak>'
# Older arpabet version
# ssml_text = '<speak>You say <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@EY1}{@T}{@OW2}">tomato</phoneme>, I say <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@AA1}{@T}{@OW2}">tomato</phoneme>.</speak>'

print("Raw Text: ", raw_text)
print("SSML Text: ", ssml_text)

req["text"] = ssml_text
# Request to Riva TTS to synthesize audio
resp = riva_tts.synthesize(**req)

# Playing the generated audio from Riva TTS request
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))

Expected results if you run the tutorial:#

You say <phoneme alphabet="ipa" ph="təˈmeɪˌtoʊ">tomato</phoneme>, I say <phoneme alphabet="ipa" ph="təˈmɑˌtoʊ">tomato</phoneme>.

Here are more examples showing the customization of pronunciation in generated audio:

# SSML texts we want to try
ssml_texts = [
  """<speak>You say <phoneme ph="ˈdeɪtə">data</phoneme>, I say <phoneme ph="ˈdætə">data</phoneme>.</speak>""",
  """<speak>Some people say <phoneme ph="ˈɹut">route</phoneme> and some say <phoneme ph="ˈɹaʊt">route</phoneme>.</speak>""",
]
# Older arpabet version
# ssml_texts = [
#   """<speak>You say <phoneme alphabet="x-arpabet" ph="{@D}{@EY1}{@T}{@AH0}">data</phoneme>, I say <phoneme alphabet="x-arpabet" ph="{@D}{@AE1}{@T}{@AH0}">data</phoneme>.</speak>""",
#   """<speak>Some people say <phoneme alphabet="x-arpabet" ph="{@R}{@UW1}{@T}">route</phoneme> and some say <phoneme alphabet="x-arpabet" ph="{@R}{@AW1}{@T}">route</phoneme>.</speak>""",
# ]

# Loop through 'ssml_texts' list and synthesize audio with Riva TTS for each entry 'ssml_texts'
for ssml_text in ssml_texts:
    req["text"] = ssml_text
    resp = riva_tts.synthesize(**req)
    audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
    print("SSML Text: ", ssml_text)
    ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))
    print("--------------------------------------------")

Expected results if you run the tutorial:#

You say <phoneme ph="ˈdeɪtə">data</phoneme>, I say <phoneme ph="ˈdætə">data</phoneme>.

Some people say <phoneme ph="ˈɹut">route</phoneme> and some say <phoneme ph="ˈɹaʊt">route</phoneme>.

Replacing pronunciation with the sub tag#

We can use the sub tag to replace the pronounciation of the specified word or phrase with a different word or phrase. You can specify the pronunciation to substitute with the alias attribute.

# Setting up Riva TTS request with SynthesizeSpeechRequest
"""
    Raw text is "WWW is know as the web"
    We are updating this raw text with SSML:
    1. Envelope raw text in '<speak>' tags as is required for SSML
    2. Add '<sub>' tag with 'alias' attribute set to replace www with `World Wide Web`

"""
raw_text = "WWW is know as the web."
ssml_text = '<speak><sub alias="World Wide Web">WWW</sub> is known as the web.</speak>'

print("Raw Text: ", raw_text)
print("SSML Text: ", ssml_text)

req["text"] = ssml_text
# Request to Riva TTS to synthesize audio
resp = riva_tts.synthesize(**req)
# Playing the generated audio from Riva TTS request
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))

Expected results if you run the tutorial:#

<sub alias="World Wide Web">WWW</sub> is known as the web.

Emphasize words with the emphasis tag#

Use the emphasis tag to emphasize words. Use the riva-build command with the enable_emphasis_tag, start_of_emphasis_token, and end_of_emphasis_token tags to enable the emphasis feature. The emphasis tag should be used per word basis. If the word ends with a punctuation, only the word will be emphasized and not the punctuation.

Limitation#

The emphasis tag is training data dependent and is available only in the English-US model. The models which are trained without the emphasis tag in the training data will not result in emphasized speech. Input text containing more than one word wrapped by the emphasis tag is an invalid input. Space wrapped inside the emphasis tag is also an invalid input.

Warning: The emphasis tag feature does not support nesting of other SSML tags inside it. The emphasis tag does not support the level attribute.
# Setting up Riva TTS request with SynthesizeSpeechRequest
"""
    Raw text is "Hello World"
    We are updating this raw text with SSML:
    1. Envelope raw text in '<speak>' tags as is required for SSML
    2. Add '<emphasis>' tag around `love`

"""
ssml_texts = [
   """<speak>I would <emphasis>love</emphasis> to try that.</speak>""",
   """<speak><emphasis>Wow!</emphasis> Thats really cool.</speak>"""
]

print("Raw Text: ", raw_text)
print("SSML Text: ", ssml_text)

for ssml_text in ssml_texts:
    req["text"] = ssml_text
    resp = riva_tts.synthesize(**req)
    # Playing the generated audio from Riva TTS request
    audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
    print("SSML Text: ", ssml_text)
    ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))
    print("--------------------------------------------")

Expected results if you run the tutorial:#

I would <emphasis>love</emphasis> to try that.

<emphasis>Wow!</emphasis> Thats really cool.