How do I use Riva TTS APIs with out-of-the-box models?#

This tutorial walks you through the basics of Riva Speech Skills’s TTS Services, specifically covering how to use Riva TTS APIs with out-of-the-box models.

NVIDIA Riva Overview#

NVIDIA Riva is a GPU-accelerated SDK for building Speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

  • Automated speech recognition (ASR)

  • Text-to-Speech synthesis (TTS)

  • A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, intent classification.

In this tutorial, we will interact with the text-to-speech synthesis (TTS) APIs.

For more information about Riva, please refer to the Riva developer documentation.

Speech generation with Riva TTS APIs#

The Riva TTS service is based on a two-stage pipeline: Riva first generates a mel spectrogram using the first model, then generates speech using the second model. This pipeline forms a text-to-speech system that enables you to synthesize natural sounding speech from raw transcripts without any additional information such as patterns or rhythms of speech.

Riva provides two state-of-the-art voices (one male and one female) for English, that can easily be deployed with the Riva Quick Start scripts. Riva also supports easy customization of TTS in various ways, to meet your specific needs.
Subsequent Riva releases will include features such as model registration to support multiple languages/voices with the same API and support for resampling to alternative sampling rates.
Refer to the Riva TTS documentation for more information.

Now, let’s generate audio using Riva APIs with an OOTB (out-of-the-box) English TTS pipeline.

Requirements and setup#

  1. Start the Riva Speech Skills server.
    Follow the instructions in the Riva Quick Start Guide to deploy OOTB ASR models on the Riva Speech Skills server before running this tutorial. By default, only the English models are deployed.

  2. Install the Riva Client library.
    Follow the steps in the Requirements and setup for the Riva Client to install the Riva Client library.

  3. Install the additional Python libraries to run this tutorial.
    Run the following commands to install the libraries:

# We need numpy to read the output from the Riva TTS request.
!pip install numpy

Import Riva client libraries#

We first import some required libraries, including the Riva client libraries

import numpy as np
import IPython.display as ipd
import riva.client

Create Riva clients and connect to Riva Speech API server#

The below URI assumes a local deployment of the Riva Speech API server on the default port. In case the server deployment is on a different host or via Helm chart on Kubernetes, the user should use an appropriate URI.

auth = riva.client.Auth(uri='localhost:50051')

riva_tts = riva.client.SpeechSynthesisService(auth)

Batch mode TTS#

Riva TTS supports both streaming and batch inference modes. In batch mode, audio is not returned until the full audio sequence for the requested text is generated and can achieve higher throughput. But when making a streaming request, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.
Let’s take a look at an example showing batch mode TTS API usage:

Make a gRPC request to the Riva Speech API server#

Now let us make a gRPC request to the Riva Speech servers, for TTS, in batch inference mode.

sample_rate_hz = 44100
resp = riva_tts.synthesize(
    text = "Is it recognize speech or wreck a nice beach?",
    language_code = "en-US",
    encoding = riva.client.AudioEncoding.LINEAR_PCM,    # Currently only LINEAR_PCM is supported
    sample_rate_hz = sample_rate_hz,                    # Generate 44.1KHz audio
    voice_name = "English-US-Female-1"         # The name of the voice to generate
)
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.Audio(audio_samples, rate=sample_rate_hz)

Understanding TTS API parameters#

Riva TTS supports a number of options while making a text-to-speech request to the gRPC endpoint, as shown above. Let’s learn more about these parameters:

  • language_code - Language of the generated audio. "en-US" represents English (US) and is currently the only language supported OOTB.

  • encoding - Type of audio encoding to generate. Currently only LINEAR_PCM is supported.

  • sample_rate_hz - Sample rate of the generated audio. Depends on the microphone and is usually 22khz or 44khz.

  • voice_name - Voice used to synthesize the audio. Currently, Riva offers two OOTB voices (English-US-Female-1, English-US-Male-1).

Go deeper into Riva capabilities#

Now that you have a basic introduction to the Riva TTS APIs, you can try:

Additional Riva tutorials#

Checkout more Riva TTS (and ASR) tutorials here to understand how to use some of the advanced features of Riva TTS, including customizing TTS for your specific needs.

Sample applications#

Riva comes with various sample applications. They demonstrate how to use the APIs to build applications such as a chatbot, a domain specific speech recognition, keyword (entity) recognition system, or simply how Riva allows scaling out for handling massive amounts of requests at the same time. Refer to (SpeechSquad) for more information.
Refer to the Sample Application section in the Riva developer documentation for more information.

Riva Automated Speech Recognition (ASR)#

Riva’s ASR offering comes with OOTB pipelines for English, German, Spanish, Russian and Mandarin. It can be used in streaming or batch inference modes and easily deployed using the Riva Quick Start scripts. Follow this link to better understand Riva’s ASR capabilities. Explore how to use Riva ASR APIs with the OOTB voices with this Riva ASR tutorial.

Additional resources#

For more information about each of the APIs and their functionalities, refer to the documentation.