<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_tts_tts-python-basics/nvidia_logo.png" style="width: 90px; float: right;">

# Riva TTS NIM Tutorial

This tutorial walks you through the various features of Riva TTS NIM and how to the APIs in a Python application. Riva TTS NIM uses gRPC API to serve offline and online use cases.

## Prerequisites

1. Deploy Riva ASR NIM with Parakeet 1.1b en-US model using Riva ASR NIM documentation.
2. Install the Riva Python Client library
    ```bash
    sudo apt-get install python3-pip
    pip install nvidia-riva-client
    ```

#### Import Riva Client Libraries

Begin by importing some of the necessary libraries, including the Riva Client libraries.

In [None]:
# Import required libraries
import io
import wave
from pathlib import Path
import riva.client
import IPython.display as ipd

#### Create Riva clients and connect to the Riva server

The following URI assumes a local deployment of the Riva TTS NIM on the default port. In case the server deployment is on a different host or via Helm chart on Kubernetes, use an appropriate URI.

In [None]:
auth = riva.client.Auth(uri='0.0.0.0:50051')
tts_service = riva.client.SpeechSynthesisService(auth)

Get list of supported TTS languages and voices.

In [None]:
request = riva.client.proto.riva_tts_pb2.RivaSynthesisConfigRequest()
response = tts_service.stub.GetRivaSynthesisConfig(request)

tts_models = dict()
for model_config in response.model_config:
    language_code = model_config.parameters['language_code']
    voice_name = model_config.parameters['voice_name']
    subvoices = [voice.split(':')[0] for voice in model_config.parameters['subvoices'].split(',')]
    full_voice_names = [voice_name + "." + subvoice for subvoice in subvoices]
    tts_models[language_code] = full_voice_names

print(tts_models)

### TTS modes

Riva TTS supports both streaming and offline inference modes. In offline mode, audio is not returned until the full audio sequence for the requested text is generated. In streaming mode, audio chunks are returned as soon as they are generated, significantly reducing the latency (as measured by time to first audio) for large requests.

#### Setup the TTS API parameters

Lets synthesize the audio for English (en-US) text with one of the available voice.

In [None]:
request = riva.client.proto.riva_tts_pb2.SynthesizeSpeechRequest(
    language_code = "en-US",
    encoding = riva.client.AudioEncoding.LINEAR_PCM,
    sample_rate_hz = 44100,
    voice_name = "Magpie-Multilingual.EN-US.Female.Neutral"
)

#### Understanding TTS API parameters

Riva TTS supports a number of options while making a text-to-speech request to the gRPC endpoint, as shown above. Let's learn more about these parameters:
- ``language_code`` - Language of the generated audio. ``en-US`` represents English (US) and is currently the only language supported OOTB.
- ``encoding`` - Type of audio encoding to generate. ``LINEAR_PCM`` and ``OGGOPUS`` encodings are supported.
- ``sample_rate_hz`` - Sample rate of the generated audio. Depends on the microphone and is usually ``22khz`` or ``44khz``.
- ``voice_name`` - Voice used to synthesize the audio.

Create a utility function to save synthesized audio to file on disk.

In [None]:
def save_audio_to_file(filename, audio, rate):
    with wave.open(filename, "wb") as wav_file:
        wav_file.setnchannels(1)
        wav_file.setsampwidth(2)
        wav_file.setframerate(rate)
        wav_file.writeframes(audio)

#### Make a gRPC Request to Riva Server

For offline inference mode, use `Synthesize`. Results are returned when the entire audio is synthesized.

In [None]:
request.text = "Is it recognize speech or wreck a nice beach?"
response = tts_service.stub.Synthesize(request)

save_audio_to_file("output.wav", response.audio, 44100)
ipd.Audio("output.wav")


For online inference, use `synthesize_online`. Results are returned in chunks as they are synthesized.

In [None]:
request.text = "Is it recognize speech or wreck a nice beach?"
responses = tts_service.stub.SynthesizeOnline(request)
bytes_buffer = io.BytesIO()
for response in responses:
    bytes_buffer.write(response.audio)

save_audio_to_file("output_online.wav", bytes_buffer.getvalue(), 44100)
ipd.Audio("output_online.wav")

## Customizing Riva TTS audio output with SSML

Speech Synthesis Markup Language (SSML) specification is a markup for directing the performance of the virtual speaker. Riva supports portions of SSML, allowing you to adjust pitch, rate, and pronunciation of the generated audio.

All SSML inputs must be a valid XML document and use the <speak> root tag. All non-valid XML and all valid XML with a different root tag are treated as raw input text.

Riva TTS supports the following SSML tags:

- The ``phoneme`` tag, which allows us to control the pronunciation of the generated audio.
    
Let's look at customization of Riva TTS with these SSML tags in some detail.

### Customizing pronunciation with the `phoneme` tag

We can use the `phoneme` tag to override the pronunciation of words from the predicted pronunciation. For a given word or sequence of words, use the `ph` attribute to provide an explicit pronunciation, and the `alphabet` attribute to provide the phone set.

Riva TTS supports `ipa` as the only supported prounciation alphabet for TTS models.

#### IPA
For the full list of supported `ipa` phonemes, refer to the [Riva TTS Phoneme Support](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-phones.html) page.

#### Arpabet
The full list of phonemes in the CMUdict can be found at [cmudict.phone](https://github.com/cmusphinx/cmudict/blob/master/cmudict.phones). The list of supported symbols with stress can be found at [cmudict.symbols](https://github.com/cmusphinx/cmudict/blob/master/cmudict.symbols). For a mapping of these phones to English sounds, refer to the [ARPABET Wikipedia page](https://en.wikipedia.org/wiki/ARPABET).

Examples showing the customization of pronunciation in generated audio using Phoneme tag.

In [None]:
# SSML text examples with Phoneme tag
"""
Insturctions for using Phoneme tag:
1. Envelope raw text in '<speak>' tags as is required for SSML
2. For a substring in the raw text, add '<phoneme>' tags with 'alphabet' attribute set to 'x-arpabet'
       (currently the only supported value) and 'ph' attribute set to a custom pronunciation based on CMUdict and ARPABET
"""
ssml_texts = [
  """<speak>You say <phoneme alphabet='ipa' ph='təˈmeɪˌtoʊ'>tomato</phoneme>, I say <phoneme alphabet='ipa' ph='təˈmɑˌtoʊ'>tomato</phoneme>.</speak>""",
  """<speak>You say <phoneme ph="ˈdeɪtə">data</phoneme>, I say <phoneme ph="ˈdætə">data</phoneme>.</speak>""",
  """<speak>Some people say <phoneme ph="ˈɹut">route</phoneme> and some say <phoneme ph="ˈɹaʊt">route</phoneme>.</speak>""",
]

# Loop through 'ssml_texts' list and synthesize audio with Riva TTS for each entry 'ssml_texts'
for i, ssml_text in enumerate(ssml_texts):
    request.text = ssml_text
    response = tts_service.stub.Synthesize(request)
    save_audio_to_file(f"output_ssml_{i}.wav", response.audio, 44100)
    print(f"Sythesized audio for SSML Text: {ssml_text}")
    ipd.display(ipd.Audio(f"output_ssml_{i}.wav", rate=44100))
