Customization#

Customizing using SSML#

Speech Synthesis Markup Language SSML specification is a markup language for directing the performance of the virtual speaker. Riva supports portions of SSML, allowing you to adjust pitch, rate, and pronunciation of the generated audio.

This section provides some examples on how to customize Riva TTS through the following SSML tags:

  • The prosody tag, which supports attributes rate, pitch, and volume, through which we can control the rate, pitch, and volume of the generated audio.

  • The phoneme tag, which allows us to control the pronunciation of the generated audio.

  • The sub tag, which allows us to replace the pronunciation of the specified word or phrase with a different word or phrase.

The following table provides SSML tags supported by different models.

Model

Prosody tag

Phoneme tag

Sub tag

Magpie TTS Multilingual

Magpie TTS Zero Shot

Magpie Flow

Fastpitch HifiGAN en-US

Note

All SSML inputs must be a valid XML document and use the root tag. All non-valid XML and all valid XML with a different root tag are treated as raw input text.

Rate Attribute#

Riva supports a percentage relative change to the rate. The rate attribute has a range of [25%, 250%]. Values outside this range result in an error being logged and no audio returned. Riva also supports the following tags as per the SSML specs: x-low, low, medium, high, x-high, and default.

The rate attribute is expressed in the following formats:

  • rate=”35%”

  • rate=”+200%”

  • rate=”low”

Pitch Attribute#

Riva supports an additive relative change to the pitch. The pitch attribute has a range of [-3, 3] or [-150, 150] Hz. Values outside this range result in an error being logged and no audio returned.

When using an absolute value that doesn’t end in Hz, pitch is shifted by that value multiplied with the speaker’s pitch standard deviation as defined in the model configs. For the pretrained checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of 1.25 results in a change of 1.25*52.185=~65.23 Hz pitch shift up.

Riva also supports the following tags as per the SSML specs: x-low, low, medium, high, x-high, and default.

The pitch attribute is expressed in the following formats:

  • pitch=”1”

  • pitch=”95hZ”

  • pitch=”+1.8”

  • pitch=”-0.65”

  • pitch=”+75Hz”

  • pitch=”-84.5Hz”

  • pitch=”high”

For the pretrained Female-1 checkpoint, the standard deviation is 53.33 Hz. For the pretrained Male-1 checkpoint, the standard deviation is 47.15 Hz.

Note

The pitch attribute does not support st and % changes.

Volume Attribute#

Riva supports the volume attribute as described in the SSML specs. The volume attribute supports a range of [-13, 8]dB. Values outside this range result in an error being logged and no audio returned. Tags are silent, x-soft, soft, medium, loud, x-loud, and default are supported.

The volume attribute is expressed in the following formats:

  • volume=”+1dB”

  • volume=”-5.7dB”

  • volume=”x-loud”

Examples#

Customizing rate, pitch, and volume with the prosody tag

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --text "<speak><prosody pitch='2.5'>Today is a sunny day</prosody>. <prosody rate='high' volume='+1dB'>But it might rain tomorrow.</prosody></speak>" --language-code en-US

Customizing pronunciation with the phoneme tag

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 --text "<speak>You say <phoneme alphabet='ipa' ph='təˈmeɪˌtoʊ'>tomato</phoneme>, I say <phoneme alphabet='ipa' ph='təˈmɑˌtoʊ'>tomato</phoneme>.</speak>" --language-code en-US

Replacing pronunciation with the sub tag

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 --text "<speak><sub alias='World Wide Web'>WWW</sub> is known as the web.</speak>" --language-code en-US

The synthesized audio file output.wav will contain the resulting speech with SSML attributes applied.

Customizing using Custom Pronunciation Dictionary#

Riva TTS enables custom pronunciation through a text-based dictionary that maps words to their desired phonetic representations. The dictionary format requires each entry to have a word (grapheme) followed by its pronunciation (phoneme), with exactly two spaces separating them. Multi-word grapheme/phoneme pairs should be split into individual lines. You can define multiple custom pronunciations in a single dictionary file by adding each pair on a new line. To utilize this functionality, specify the dictionary file using the custom_dictionary parameter in your client request configuration. Refer to Phoneme Support for a list of supported phonemes.

Example#

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 --text "Today is a sunny day, a great day to eat fresh tomato" --language-code en-US --custom-dictionary custom_dict.txt

The custom dictionary file custom_dict.txt should contain the following text:

sunny  ˈsʌnɪ
tomato  ˈtɑˌməʊ

The synthesized audio file output.wav will contain the resulting speech with the custom pronunciation applied.