Controllable TTS (SSML)#

Riva supports portions of the Speech Synthesis Markup Language (SSML) specification. SSML is a markup for directing the performance of the virtual speaker. Using SSML, you can adjust pitch, rate, and pronunciation through phoneme.

Only the FastPitch model is supported at this time. The FastPitch model must be exported using NeMo>=1.5.1 and the nemo2riva>=1.8.0 tool. All SSML inputs must be a valid XML document and use the <speak> root tag. All nonvalid XML and all valid XML with a different root tag are treated as raw input text. Riva currently supports the following:

Prosody#

Pitch Attribute#

Riva supports an additive relative change to the pitch. The pitch attribute has a range of [-3, 3]. Values outside this range result in an error being logged, and no audio returned. Note that this value returns a pitch shift of the attribute value multiplied with the speaker’s pitch standard deviation when the FastPitch model is trained. For the pretrained checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of 1.25 results in a change of 1.25*52.185=~65.23Hz pitch shift up. Riva also supports the prosody tags as per the SSML specs. Prosody tags x-low, low, medium, high, x-high, and default are supported.

The pitch attribute is expressed in the following formats:

  • pitch="1"

  • pitch="+1.8"

  • pitch="-0.65"

  • pitch="high"

  • pitch="default"

For the pretrained Female-1 voice, the standard deviation is 53.33 Hz.

For the pretrained Male-1 voice, the standard deviation is 47.15 Hz.

Warning

The pitch attribute does not support Hz, st, and % changes.

Rate Attribute#

Riva supports a % relative change to the rate. The rate attribute has a range of [25%, 250%]. Values outside this range result in an error being logged and no audio returned. It also supports the prosody tags as per the SSML specs. Prosody tags x-low, low, medium, high, x-high, and default are supported.

The rate attribute is expressed in the following formats:

  • rate="35%"

  • rate="+200%"

  • rate="low"

  • rate="default"

Volume Attribute#

Riva supports the volume attribute as described in the SSML specs. The volume attribute supports a range of [-13, 8]dB. Values outside this range result in an error being logged and no audio returned. Tags silent, x-soft, soft, medium, loud, x-loud, and default are supported.

The volume attribute is expressed in the following formats:

  • volume="+1dB"

  • volume="-5.7dB"

  • volume="x-loud"

  • volume="default"

Examples#

<speak>
    <prosody pitch="1.0">Now I'm speaking a</prosody>
    <prosody pitch="2.0">bit</prosody>
    higher.
</speak>
<speak><prosody rate="200%">This is a fast sentence.</prosody></speak>

Phoneme#

Use the phoneme tag to override the pronunciation of words from the predicted pronunciation. For a given word or sequence of word, provide an explicit pronunciation by setting the ph attribute and the phone set used with the alphabet attribute. Currently, only x-arpabet is supported for pronunciation dictionaries based on CMUdict. IPA support will be added soon.

The full list of phonemes in the CMUdict can be found at cmudict/cmudict.phones. The list of supported symbols with stress can be found at cmudict/cmudict.symbols. For a mapping of these phones to English sounds, refer to ARPABET Wikipedia.

Examples#

<speak>
    Same thing!
    <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@EY1}{@T}{@OW2}">tomato</phoneme>,
    <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@AA1}{@T}{@OW2}">tomato</phoneme>.
</speak>

Sub#

Use the sub tag to replace the pronounciation of the specified word or phrase with a different word or phrase. You can specify the pronunciation to substitute with the alias attribute.

Examples#

<speak>
    <sub alias="World Wide Web">WWW</sub>
</speak>

Emphasis#

Use the emphasis tag to emphasize words. Use riva-build with enable_emphasis_tag, start_of_emphasis_token, and end_of_emphasis_token to enable the emphasis feature.

Examples#

<speak>
    <emphasis>Hello</emphasis> World!
</speak>

The emphasis tag should be used per word basis. If the word ends with a punctuation, only the word will be emphasized and not the punctuation.

Limitation#

The emphasis tag is training data dependent and is available only in a few models. The models which are trained without the emphasis tag in the training data will not result in emphasized speech. Input text containing more than one word wrapped by the emphasis tag is an invalid input. Space wrapped inside the emphasis tag is also an invalid input.

Warning

The emphasis tag feature does not support nesting of other SSML tags inside it. The emphasis tag does not support the level attribute.

Try It Now#

For SSML examples with sample audio, refer to the TTS Tutorials section.