How do I customize Riva TTS audio output with SSML?#

This tutorial walks you through some of the advanced features for customization of Riva TTS audio output with Speech Synthesis Markup Language (SSML).

NVIDIA Riva Overview#

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

  • Automated speech recognition (ASR)

  • Text-to-Speech synthesis (TTS)

  • A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will customize Riva TTS audio output with SSML.
To understand the basics of Riva TTS APIs, refer to How do I use Riva TTS APIs with out-of-the-box models?.

For more information about Riva, refer to the Riva developer documentation.

Customizing Riva TTS audio output with SSML#

Speech Synthesis Markup Language (SSML) specification is a markup for directing the performance of the virtual speaker. Riva supports portions of SSML, allowing you to adjust pitch, rate, and pronunciation of the generated audio.
SSML support is available only for the FastPitch model at this time. The FastPitch model must be exported using NeMo>=1.5.1 and the nemo2riva>=1.8.0 tool.

All SSML inputs must be a valid XML document and use the root tag. All non-valid XML and all valid XML with a different root tag are treated as raw input text.

Riva TTS supports two SSML tags:

  • The prosody tag, which supports two attributes rate and pitch, through which we can control the rate and pitch of the generated audio.

  • The phoneme tag, which allows us to control the pronunciation of the generated audio.

Let’s look at customization of Riva TTS with these SSML tags in some detail.

Requirements and setup#

  1. Start the Riva Speech Skills server.
    Follow the instructions in the Riva Quick Start Guide to deploy OOTB TTS models on the Riva Speech Skills server before running this tutorial.

  2. Install the Riva Client library.
    Follow the steps in the Requirements and setup for the Riva Client to install the Riva Client library.

  3. Install the additional Python libraries to run this tutorial.
    Run the following commands to install the libraries:

# We need numpy to read the output from Riva TTS request
!pip install numpy

Import Riva Client Libraries#

Let’s first import some required libraries, including the Riva Client libraries:

import numpy as np
import IPython.display as ipd
import grpc

import riva.client

Create Riva Clients and connect to the Riva Speech API server#

The below URI assumes a local deployment of the Riva Speech API server is on the default port. In case the server deployment is on a different host or via Helm chart on Kubernetes, use an appropriate URI.

auth = riva.client.Auth(uri='localhost:50051')

riva_tts = riva.client.SpeechSynthesisService(auth)

Customizing rate and pitch with the prosody tag#

Pitch Attribute#

Riva supports an additive relative change to the pitch. The pitch attribute has a range of [-3, 3]. Values outside this range result in an error being logged and no audio returned. This value returns a pitch shift of the attribute value multiplied with the speaker’s pitch standard deviation when the FastPitch model is trained. For the pretrained checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of 1.25 results in a change of 1.25*52.185=~65.23Hz pitch shift up. Riva also supports the prosody tags as per the SSML specs. Prosody tags x-low, low, medium, high, x-high, and default are supported.

The pitch attribute is expressed in the following formats:

  • pitch="1"

  • pitch="+1.8"

  • pitch="-0.65"

  • pitch="high"

  • pitch="default"

For the pretrained Female-1 checkpoint, the standard deviation is 53.33 Hz. For the pretrained Male-1 checkpoint, the standard deviation is 47.15 Hz.

The pitch attribute does not support Hz, st, and % changes. Support is planned for a future Riva release.

Rate Attribute#

Riva supports a percentage relative change to the rate. The rate attribute has a range of [25%, 250%]. Values outside this range result in an error being logged and no audio returned. Riva also supports the prosody tags as per the SSML specs. Prosody tags x-low, low, medium, high, x-high, and default are supported.

The rate attribute is expressed in the following formats:

  • rate="35%"

  • rate="+200%"

  • rate="low"

  • rate="default"

Let’s look at an example showing these pitch and rate customizations for Riva TTS:

"""
    Raw text is "Today is a sunny day. But it might rain tomorrow."
    We are updating this raw text with SSML:
    1. Envelope raw text in '<speak>' tags as is required for SSML
    2. Add '<prosody>' tag with 'pitch' attribute set to '2.5'
    3. Add '<prosody>' tag with 'rate' attribute set to 'high'
"""
raw_text = "Today is a sunny day. But it might rain tomorrow."
ssml_text = """<speak><prosody pitch='2.5'>Today is a sunny day</prosody>. <prosody rate='high'>But it might rain tomorrow.</prosody></speak>"""
print("Raw Text: ", raw_text)
print("SSML Text: ", ssml_text)


sample_rate_hz = 44100
# Request to Riva TTS to synthesize audio
resp = riva_tts.synthesize(
    text=ssml_text,
    voice_name="English-US-Female-1",
    language_code="en-US",
    encoding=riva.client.AudioEncoding.LINEAR_PCM,
    sample_rate_hz=sample_rate_hz,
)

# Playing the generated audio from Riva TTS request
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))

Here are more examples showing the effects of changes in pitch, and rate attribute values on the generated audio:

# SSML texts we want to try
ssml_texts = [
  """<speak>This is a normal sentence</speak>""",
  """<speak><prosody pitch="0." rate="100%">This is also a normal sentence</prosody></speak>""",
  """<speak><prosody rate="200%">This is a fast sentence</prosody></speak>""",
  """<speak><prosody rate="60%">This is a slow sentence</prosody></speak>""",
  """<speak><prosody pitch="+1.0">Now, I'm speaking a bit higher</prosody></speak>""",
  """<speak><prosody pitch="-0.5">And now, I'm speaking a bit lower</prosody></speak>""",
  """<speak>S S M L supports <prosody pitch="-1">nested tags. So I can speak <prosody rate="150%">faster</prosody>, <prosody rate="75%">or slower</prosody>, as desired.</prosody></speak>""",
]

sample_rate_hz = 44100
# Loop through 'ssml_texts' list and synthesize audio with Riva TTS for each entry 'ssml_texts'
for ssml_text in ssml_texts:
    resp = riva_tts.synthesize(
        text=ssml_text,
        voice_name="English-US-Female-1",
        language_code="en-US",
        encoding=riva.client.AudioEncoding.LINEAR_PCM,
        sample_rate_hz=sample_rate_hz,
    )
    audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
    print("SSML Text: ", ssml_text)
    ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))
    print("--------------------------------------------")

Customizing pronunciation with the phenome tag#

We can use the phoneme tag to override the pronunciation of words from the predicted pronunciation. For a given word or sequence of words, use the ph attribute to provide an explicit pronunciation, and the alphabet attribute to provide the phone set.
Currently, only x-arpabet is supported for pronunciation dictionaries based on CMUdict. IPA support will be added soon.

The full list of phonemes in the CMUdict can be found at cmudict.phone. The list of supported symbols with stress can be found at cmudict.symbols. For a mapping of these phones to English sounds, refer to the ARPABET Wikipedia page.

Let’s look at an example showing this custom pronunciation for Riva TTS:

# Setting up Riva TTS request with SynthesizeSpeechRequest
"""
    Raw text is "You say tomato, I say tomato."
    We are updating this raw text with SSML:
    1. Envelope raw text in '<speak>' tags as is required for SSML
    2. For a substring in the raw text, add '<phoneme>' tags with 'alphabet' attribute set to 'x-arpabet' 
       (currently the only supported value) and 'ph' attribute set to a custom pronunciation based on CMUdict and ARPABET

"""
raw_text = "You say tomato, I say tomato."
ssml_text = '<speak>You say <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@EY1}{@T}{@OW2}">tomato</phoneme>, I say <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@AA1}{@T}{@OW2}">tomato</phoneme>.</speak>'

print("Raw Text: ", raw_text)
print("SSML Text: ", ssml_text)

sample_rate_hz = 44100
# Request to Riva TTS to synthesize audio
resp = riva_tts.synthesize(
    text=ssml_text,
    voice_name="English-US-Female-1",
    language_code="en-US",
    encoding=riva.client.AudioEncoding.LINEAR_PCM,
    sample_rate_hz=sample_rate_hz,
)

# Playing the generated audio from Riva TTS request
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))

Here are more examples showing the customization of pronunciation in generated audio:

# SSML texts we want to try
ssml_texts = [
  """<speak>Is it <phoneme alphabet="x-arpabet" ph="{@S}{@K}{@EH1}{@JH}{@UH0}{@L}">schedule</phoneme> or <phoneme alphabet="x-arpabet" ph="{@SH}{@EH1}{@JH}{@UW0}{@L}">schedule</phoneme>?</speak>""",
  """<speak>You say <phoneme alphabet="x-arpabet" ph="{@D}{@EY1}{@T}{@AH0}">data</phoneme>, I say <phoneme alphabet="x-arpabet" ph="{@D}{@AE1}{@T}{@AH0}">data</phoneme>.</speak>""",
  """<speak>Some people say <phoneme alphabet="x-arpabet" ph="{@R}{@UW1}{@T}">route</phoneme> and some say <phoneme alphabet="x-arpabet" ph="{@R}{@AW1}{@T}">route</phoneme>.</speak>""",
]

# Loop through 'ssml_texts' list and synthesize audio with Riva TTS for each entry 'ssml_texts'
sample_rate_hz = 44100
for ssml_text in ssml_texts:
    resp = riva_tts.synthesize(
        text=ssml_text,
        voice_name="English-US-Female-1",
        language_code="en-US",
        encoding=riva.client.AudioEncoding.LINEAR_PCM,
        sample_rate_hz=sample_rate_hz,
    )
    audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
    print("SSML Text: ", ssml_text)
    ipd.display(ipd.Audio(audio_samples, rate=sample_rate_hz))
    print("--------------------------------------------")

Information about customizing Riva TTS with SSML can also be found in the documentation here.

Go deeper into Riva capabilities#

Additional Riva tutorials#

Checkout more Riva TTS (and ASR) tutorials here. These tutorials provide a deeper understanding of the advanced features of Riva TTS, including customizing TTS for your specific needs.

Sample applications#

Riva comes with various sample applications. They demonstrate how to use the APIs to build applications such as a chatbot, a domain specific speech recognition, keyword (entity) recognition system, or simply how Riva allows scaling out for handling massive amounts of requests at the same time. Refer to (SpeechSquad) for more information.
Refer to the Sample Application section in the Riva developer documentation for more information.

Riva Automated Speech Recognition (ASR)#

Riva’s ASR offering comes with OOTB pipelines for English, German, Spanish, Russian and Mandarin. It can be used in streaming or batch inference modes and easily deployed using the Riva Quick Start scripts. Follow this link to better understand Riva’s ASR capabilities. Explore how to use Riva ASR APIs with the OOTB voices with this Riva ASR tutorial.

Additional resources#

For more information about each of the APIs and their functionalities, refer to the documentation.