Customizing using SSML#

Speech Synthesis Markup Language SSML specification is a markup language for directing the performance of the virtual speaker. Riva supports portions of SSML, allowing you to adjust pitch, rate, and pronunciation of the generated audio.

This section provides some examples on how to customize Riva TTS through the following SSML tags:

The prosody tag, which supports attributes rate, pitch, and volume, through which we can control the rate, pitch, and volume of the generated audio.
The phoneme tag, which allows us to control the pronunciation of the generated audio.
The sub tag, which allows us to replace the pronunciation of the specified word or phrase with a different word or phrase.

The following table provides SSML tags supported by different models.

Model	Prosody tag	Phoneme tag	Sub tag
Magpie TTS Multilingual	❌	✅	❌
Magpie TTS Zero Shot	❌	✅	❌
Magpie Flow	❌	❌	❌
Fastpitch HifiGAN en-US	✅	✅	✅

Note

All SSML inputs must be a valid XML document and use the root tag. All non-valid XML and all valid XML with a different root tag are treated as raw input text.

Rate Attribute#

Riva supports a percentage relative change to the rate. The rate attribute has a range of [25%, 250%]. Values outside this range result in an error being logged and no audio returned. Riva also supports the following tags as per the SSML specs: x-low, low, medium, high, x-high, and default.

The rate attribute is expressed in the following formats:

rate=”35%”
rate=”+200%”
rate=”low”

Pitch Attribute#

Riva supports an additive relative change to the pitch. The pitch attribute has a range of [-3, 3] or [-150, 150] Hz. Values outside this range result in an error being logged and no audio returned.

When using an absolute value that doesn’t end in Hz, pitch is shifted by that value multiplied with the speaker’s pitch standard deviation as defined in the model configs. For the pretrained checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of 1.25 results in a change of 1.25*52.185=~65.23 Hz pitch shift up.

Riva also supports the following tags as per the SSML specs: x-low, low, medium, high, x-high, and default.

The pitch attribute is expressed in the following formats:

pitch=”1”
pitch=”95hZ”
pitch=”+1.8”
pitch=”-0.65”
pitch=”+75Hz”
pitch=”-84.5Hz”
pitch=”high”

For the pretrained Female-1 checkpoint, the standard deviation is 53.33 Hz. For the pretrained Male-1 checkpoint, the standard deviation is 47.15 Hz.

Note

The pitch attribute does not support st and % changes.

Volume Attribute#

Riva supports the volume attribute as described in the SSML specs. The volume attribute supports a range of [-13, 8]dB. Values outside this range result in an error being logged and no audio returned. Tags are silent, x-soft, soft, medium, loud, x-loud, and default are supported.

The volume attribute is expressed in the following formats:

volume=”+1dB”
volume=”-5.7dB”
volume=”x-loud”

Examples#

Customizing rate, pitch, and volume with the prosody tag

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --text "<speak><prosody pitch='2.5'>Today is a sunny day</prosody>. <prosody rate='high' volume='+1dB'>But it might rain tomorrow.</prosody></speak>" --language-code en-US

Customizing pronunciation with the phoneme tag

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 --text "<speak>You say <phoneme alphabet='ipa' ph='təˈmeɪˌtoʊ'>tomato</phoneme>, I say <phoneme alphabet='ipa' ph='təˈmɑˌtoʊ'>tomato</phoneme>.</speak>" --language-code en-US

Replacing pronunciation with the sub tag

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 --text "<speak><sub alias='World Wide Web'>WWW</sub> is known as the web.</speak>" --language-code en-US

The synthesized audio file output.wav will contain the resulting speech with SSML attributes applied.