Getting started with TTS in Python

This notebook walks through the basics of Riva Speech Skills’s TTS Services.

Overview

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

  • Automated speech recognition (ASR)

  • Text-to-Speech synthesis (TTS)

  • A collection of natural language processing (NLP) services such as named entity recognition (NER), punctuation, intent classification.

In this notebook, we will focus on interacting with the Text-to-Speech synthesis (TTS) APIs.

For more detailed information on Riva, please refer to the Riva developer documentation.

Requirements and setup

To execute this notebook, please follow the setup steps in README.

Import Riva clent libraries

We first import some required libraries, including the Riva client libraries

import numpy as np
import IPython.display as ipd
import grpc

import riva_api.riva_tts_pb2 as rtts
import riva_api.riva_tts_pb2_grpc as rtts_srv
import riva_api.riva_audio_pb2 as ra

Create Riva clients and connect to Riva Speech API server

The below URI assumes a local deployment of the Riva Speech API server on the default port. In case the server deployment is on a different host or via Helm chart on Kubernetes, the user should use an appropriate URI.

channel = grpc.insecure_channel('localhost:50051')

riva_tts = rtts_srv.RivaSpeechSynthesisStub(channel)

Batch Mode TTS

Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. When making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.
Let us take a look at an example showing Batch mode TTS API usage in this section.

Subsequent releases will include added features, including model registration to support multiple languages/voices with the same API. Support for resampling to alternative sampling rates will also be added.

Make a gRPC request to the Riva Speech API server

Now let us make a gRPC request to the Riva Speech servers, for TTS, in batch inference mode.

req = rtts.SynthesizeSpeechRequest(
    text = "Is it recognize speech or wreck a nice beach?",
    language_code = "en-US",
    encoding = ra.AudioEncoding.LINEAR_PCM,    # currently only LINEAR_PCM is supported
    sample_rate_hz = 44100,                    # generate 44.1KHz audio
    voice_name = "English-US-Female-1"         # default Riva deployments have 2 options: 
                                               # `English-US-Female-1` or `English-US-Male-1`
)

resp = riva_tts.Synthesize(req)
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.Audio(audio_samples, rate=req.sample_rate_hz)

Understanding TTS API parameters

Riva TTS supports a number of options while making a text-to-speech request to the gRPC endpoint, as shown above. Let’s learn more about these parameters:

  • language_code - Language of the generated audio. “en-US” represents english (US).

  • encoding - Type of audio encoding to generate [LINEAR_PCM, FLAC, MULAW and ALAW]

  • sample rate - Sample rate of generated audio. Depends on the microphone and is usually 22khz or 44khz.

  • voice name - Voice to use to synthesize the audio. Currently, Riva offers 2 voices [English-US-Female-1, English-US-Male-1]

Speech Synthesis Markup Language (SSML)

Riva TTS has support for some SSML attributes. Notably, there is support for

  • prosody tag

    • rate attribute

    • pitch attribute

  • phoneme tag

Please refer to the Riva docs here for a detailed description of how these SSML tags and attributes interact with the TTS system.

We provide the following examples as guidance:

req = rtts.SynthesizeSpeechRequest()
req.language_code = "en-US"
req.encoding = ra.AudioEncoding.LINEAR_PCM 
req.sample_rate_hz = 44100
req.voice_name = "English-US-Female-1"

texts = [
  """<speak>This is a normal sentence</speak>""",
  """<speak><prosody pitch="0." rate="100%">This is also a normal sentence</prosody></speak>""",
  """<speak><prosody rate="200%">This is a fast sentence</prosody></speak>""",
  """<speak><prosody pitch="1.0">Now, I'm speaking a bit higher</prosody></speak>""",
  """<speak>You say <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@EY1}{@T}{@OW2}">tomato</phoneme>, I say <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@AA1}{@T}{@OW2}">tomato</phoneme></speak>""",
  """<speak>S S M L supports <prosody pitch="-1">nested tags. So I can speak <prosody rate="150%">faster</prosody>, <prosody rate="75%">or slower</prosody>, as desired.</prosody></speak>""",
]

for t in texts:
    req.text = t
    resp = riva_tts.Synthesize(req)
    audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
    print(t)
    ipd.display(ipd.Audio(audio_samples, rate=req.sample_rate_hz))
<speak>This is a normal sentence</speak>
<speak><prosody pitch="0." rate="100%">This is also a normal sentence</prosody></speak>
<speak><prosody rate="200%">This is a fast sentence</prosody></speak>
<speak><prosody pitch="1.0">Now, I'm speaking a bit higher</prosody></speak>
<speak>You say <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@EY1}{@T}{@OW2}">tomato</phoneme>, I say <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@AA1}{@T}{@OW2}">tomato</phoneme></speak>
<speak>S S M L supports <prosody pitch="-1">nested tags. So I can speak <prosody rate="150%">faster</prosody>, <prosody rate="75%">or slower</prosody>, as desired.</prosody></speak>

Go deeper into Riva capabilities

Now that you have a basic introduction to the Riva APIs, you may like to try out:

Sample apps

Riva comes with various sample apps as a demonstration for how to use the APIs to build interesting applications such as a chatbot, a domain specific speech recognition or keyword (entity) recognition system, or simply how Riva allows scaling out for handling massive amount of requests at the same time. (SpeechSquad) Have a look at the Sample Application section in the Riva developer documentation for all the sample apps.

Finetune your a domain specific speech model and deploy with Riva

Train the latest state-of-the-art speech and natural language processing models on your own data using NeMo or Transfer Learning ToolKit and deploy them on Riva using the Riva ServiceMaker tool.

Further resources

Explore the details of each of the APIs and their functionalities in the docs.