<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_tts_tts-python-basics/nvidia_logo.png" style="width: 90px; float: right;">

# Riva TTS NIM Tutorial

This tutorial walks you through the various features of Riva TTS NIM and how to the APIs in a Python application. Riva TTS NIM uses gRPC API to serve offline and online use cases.

## Prerequisites

1. Deploy Riva TTS NIM with Magpie Multilingual model using Riva TTS NIM documentation.
2. Install the Riva Python Client library

    ```bash
    sudo apt-get install python3-pip
    pip install -U nvidia-riva-client
    ```
3. Clone the Git repository at https://github.com/nvidia-riva/tutorials for audio samples. The repository is assumed to be cloned in the `$HOME` directory.

    ```bash
    cd $HOME
    git clone https://github.com/nvidia-riva/tutorials.git
    ```
   

## Import Riva Client Libraries

Import the necessary libraries, including the Riva Client libraries.

In [1]:
# Import required libraries
import io
import json
import wave
from pathlib import Path
import riva.client
import IPython.display as ipd

Create utility functions.

In [2]:
# save synthesized audio to wav file on disk
def save_audio_to_file(filename, audio, output_sample_rate):
    with wave.open(filename, "wb") as wav_file:
        wav_file.setnchannels(1)
        wav_file.setsampwidth(2)
        wav_file.setframerate(output_sample_rate)
        wav_file.writeframes(audio)

# list available voices
def list_voices(tts_service):
    request = riva.client.proto.riva_tts_pb2.RivaSynthesisConfigRequest()
    response = tts_service.stub.GetRivaSynthesisConfig(request)

    tts_models = dict()
    for model_config in response.model_config:
        language_code = model_config.parameters['language_code']
        voice_name = model_config.parameters['voice_name']
        subvoices = [voice.split(':')[0] for voice in model_config.parameters['subvoices'].split(',')]
        full_voice_names = [voice_name + "." + subvoice for subvoice in subvoices]
        tts_models[language_code] = full_voice_names

    print(json.dumps(tts_models, indent = 4))

### Inference Modes

Riva TTS supports both streaming and offline inference modes. In offline mode, response audio is returned only after the full audio sequence for the requested text is generated. In streaming or online mode, response audio is received in chunks as it is generated. This significantly reduces the latency for large requests, particularly the time to first audio.

Following sections demonstrate using available models in Riva TTS NIM.

## Synthesize using Magpie TTS Multilingual Model

This section assumes that you have deployed **Magpie TTS Multilingual** model. Please refer [Riva TTS NIM Getting Started](https://docs.nvidia.com/nim/riva/tts/latest/getting-started.html) documentation for deployment instructions.

The **Magpie TTS Multilingual** model supports both offline and online inference modes.

Create a Riva client and query the supported languages and voices.

In [None]:
auth = riva.client.Auth(uri='0.0.0.0:50051')
tts_service = riva.client.SpeechSynthesisService(auth)
list_voices(tts_service)

Perform offline inference using `Synthesize` API. Response is received when the entire audio is synthesized. Output is saved to file `output.wav`.

In [None]:
output_sample_rate = 44100
request = riva.client.proto.riva_tts_pb2.SynthesizeSpeechRequest(
    text = "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion.",
    language_code = "en-US",
    encoding = riva.client.AudioEncoding.LINEAR_PCM,
    sample_rate_hz = output_sample_rate,
    voice_name = "Magpie-Multilingual.EN-US.Sofia" # Change according to available voices
)

response = tts_service.stub.Synthesize(request)

save_audio_to_file("output.wav", response.audio, output_sample_rate)
ipd.Audio("output.wav")

Perform online/streaming inference using `SynthesizeOnline` API. Responses are received as soon as the audio chunks are synthesized.

In [None]:
responses = tts_service.stub.SynthesizeOnline(request)
bytes_buffer = io.BytesIO()
for response in responses:
    bytes_buffer.write(response.audio)

save_audio_to_file("output_online.wav", bytes_buffer.getvalue(), output_sample_rate)
ipd.Audio("output_online.wav")

Refer [Riva TTS NIM API Reference](https://docs.nvidia.com/nim/riva/tts/latest/protos.html) for more details about the API usage.

## Synthesize using Magpie TTS Zeroshot Model

This section assumes that you have deployed the **Magpie TTS Zeroshot** model. Please refer to the [Riva TTS NIM Getting Started](https://docs.nvidia.com/nim/riva/tts/latest/getting-started.html) documentation for deployment instructions.

The **Magpie TTS Zeroshot** model supports text to speech using an input text and audio prompt. Voice characteristics from the audio prompt are applied to the synthesized output speech. The following sections demonstrate the model capability using the sample python client.


Create a Riva client and query the supported languages and voices.

In [None]:
auth = riva.client.Auth(uri='0.0.0.0:50051')
tts_service = riva.client.SpeechSynthesisService(auth)
list_voices(tts_service)

Load audio prompt to be used to synthesize speech.

In [None]:
zero_shot_audio_prompt_file = Path("~/tutorials/audio_samples/tts_samples/sample_audio_prompt.wav").expanduser() # Path to the audio prompt file
with zero_shot_audio_prompt_file.open('rb') as f:
  audio_data = f.read()
  audio_prompt_data = audio_data
if audio_prompt_data is None:
  raise ValueError("Audio prompt data is empty. Please check the file path and content.")

zero_shot_data = riva.client.proto.riva_tts_pb2.ZeroShotData(
  audio_prompt = audio_prompt_data,
  quality = 32,
)

Perform online/streaming inference using `SynthesizeOnline` API. Responses are received as soon as the audio chunks are synthesized.

In [None]:
output_sample_rate = 44100
request = riva.client.proto.riva_tts_pb2.SynthesizeSpeechRequest(
  language_code = "en-US",
  encoding = riva.client.AudioEncoding.LINEAR_PCM,
  sample_rate_hz = output_sample_rate,
  text = "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion.",
  zero_shot_data = zero_shot_data,
)

responses = tts_service.stub.SynthesizeOnline(request)
bytes_buffer = io.BytesIO()
for response in responses:
    bytes_buffer.write(response.audio)

save_audio_to_file("output_magpie_zero_shot_online.wav", bytes_buffer.getvalue(), output_sample_rate)
ipd.Audio("output_magpie_zero_shot_online.wav")

Refer [Riva TTS NIM API Reference](https://docs.nvidia.com/nim/riva/tts/latest/protos.html) for more details about the API usage.

## Synthesize using Magpie TTS Flow Model

This section assumes that you have deployed the **Magpie TTS Flow** model. Refer to the [Riva TTS NIM Getting Started](https://docs.nvidia.com/nim/riva/tts/latest/getting-started.html) documentation for deployment instructions.

The **Magpie TTS Flow** model supports text to speech using an input text and audio prompt. Voice characteristics from the audio prompt are applied to the synthesized output speech. The following sections demonstrate the model capability using the sample python client.


Create a Riva client and query the supported languages and voices.

In [None]:
auth = riva.client.Auth(uri='0.0.0.0:50051')
tts_service = riva.client.SpeechSynthesisService(auth)
list_voices(tts_service)

Magpie TTS Flow model supports only offline API.

The following code loads the audio prompt to be used for speech synthesis and performs offline inference using `Synthesize` API. The response is received when the entire audio is synthesized.

In [None]:
zero_shot_audio_prompt_file = Path("~/tutorials/audio_samples/tts_samples/sample_audio_prompt.wav").expanduser() # Path to the audio prompt file
with zero_shot_audio_prompt_file.open('rb') as f:
  audio_data = f.read()
  audio_prompt_data = audio_data
if audio_prompt_data is None:
  raise ValueError("Audio prompt data is empty. Please check the file path and content.")

zero_shot_data = riva.client.proto.riva_tts_pb2.ZeroShotData(
  audio_prompt = audio_prompt_data,
  quality = 32,
  transcript = "I consent to use my voice to create a synthetic voice."
)

request = riva.client.proto.riva_tts_pb2.SynthesizeSpeechRequest(
  language_code = "en-US",
  encoding = riva.client.AudioEncoding.LINEAR_PCM,
  sample_rate_hz = output_sample_rate,
  text = "Experience the future of speech AI with Riva, where every word comes to life with clarity and emotion.",
  zero_shot_data = zero_shot_data,
)
response = tts_service.stub.Synthesize(request)
save_audio_to_file("output_magpie_flow.wav", response.audio, output_sample_rate)
ipd.Audio("output_magpie_flow.wav")

## Customizing Riva TTS audio output with SSML

Speech Synthesis Markup Language (SSML) specification is a markup for directing the performance of the virtual speaker. Riva supports portions of SSML, allowing you to adjust pitch, rate, and pronunciation of the generated audio.

All SSML inputs must be a valid XML document and use the <speak> root tag. All non-valid XML and all valid XML with a different root tag are treated as raw input text.

Riva TTS supports the following SSML tags:

- The ``phoneme`` tag, which allows us to control the pronunciation of the generated audio.
    
Let's look at customization of Riva TTS with these SSML tags in some detail.

### Customizing pronunciation with the `phoneme` tag

We can use the `phoneme` tag to override the pronunciation of words from the predicted pronunciation. For a given word or sequence of words, use the `ph` attribute to provide an explicit pronunciation, and the `alphabet` attribute to provide the phone set.

Riva TTS supports `ipa` as the only supported pronunciation alphabet for TTS models.

#### IPA
For the full list of supported `ipa` phonemes, refer to the [Riva TTS Phoneme Support](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-phones.html) page.

#### Arpabet
The full list of phonemes in the CMUdict can be found at [cmudict.phone](https://github.com/cmusphinx/cmudict/blob/master/cmudict.phones). The list of supported symbols with stress can be found at [cmudict.symbols](https://github.com/cmusphinx/cmudict/blob/master/cmudict.symbols). For a mapping of these phones to English sounds, refer to the [ARPABET Wikipedia page](https://en.wikipedia.org/wiki/ARPABET).

Examples showing the customization of pronunciation in generated audio using Phoneme tag.

In [None]:
# SSML text examples with Phoneme tag
"""
Instructions for using Phoneme tag:
1. Envelope raw text in '<speak>' tags as is required for SSML
2. For a substring in the raw text, add '<phoneme>' tags with 'alphabet' attribute set to 'x-arpabet'
       (currently the only supported value) and 'ph' attribute set to a custom pronunciation based on CMUdict and ARPABET
"""
ssml_texts = [
  """<speak>You say <phoneme alphabet='ipa' ph='təˈmeɪˌtoʊ'>tomato</phoneme>, I say <phoneme alphabet='ipa' ph='təˈmɑˌtoʊ'>tomato</phoneme>.</speak>""",
]

# Loop through 'ssml_texts' list and synthesize audio with Riva TTS for each entry 'ssml_texts'
for i, ssml_text in enumerate(ssml_texts):
    request.text = ssml_text
    response = tts_service.stub.Synthesize(request)
    save_audio_to_file(f"output_ssml_{i}.wav", response.audio, 44100)
    print(f"Synthesized audio for SSML Text: {ssml_text}")
    ipd.display(ipd.Audio(f"output_ssml_{i}.wav", rate=44100))
