Controllable TTS (SSML)#

Riva supports portions of the Speech Synthesis Markup Language (SSML) specification. SSML is a markup for directing the performance of the virtual speaker. Using SSML, you can adjust pitch, rate, and pronunciation through phoneme.

Only the FastPitch model is supported at this time. The FastPitch model must be exported using NeMo>=1.5.1 and the nemo2riva>=1.8.0 tool. All SSML inputs must be a valid XML document and use the <speak> root tag. All nonvalid XML and all valid XML with a different root tag are treated as raw input text. Riva currently supports the following:

Prosody#

Pitch Attribute#

Riva supports an additive relative change to the pitch. The pitch attribute has a range of [-3, 3]. Values outside this range result in an error being logged, and no audio returned. Note that this value returns a pitch shift of the attribute value multiplied with the speaker’s pitch standard deviation when the FastPitch model is trained. For the pretrained checkpoint that was trained on LJSpeech, the standard deviation was 52.185. For example, a pitch shift of 1.25 results in a change of 1.25*52.185=~65.23Hz pitch shift up. Riva also supports the prosody tags as per the SSML specs. Prosody tags x-low, low, medium, high, x-high, and default are supported.

The pitch attribute is expressed in the following formats:

  • pitch="1"

  • pitch="+1.8"

  • pitch="-0.65"

  • pitch="high"

  • pitch="default"

For the pretrained Female-1 checkpoint, the standard deviation is 53.33 Hz.

For the pretrained Male-1 checkpoint, the standard deviation is 47.15 Hz.

Warning

The pitch attribute does not support Hz, st, and % changes. Support is planned for a future Riva release.

Rate Attribute#

Riva supports a % relative change to the rate. The rate attribute has a range of [25%, 250%]. Values outside this range result in an error being logged and no audio returned. It also supports the prosody tags as per the SSML specs. Prosody tags x-low, low, medium, high, x-high, and default are supported.

The rate attribute is expressed in the following formats:

  • rate="35%"

  • rate="+200%"

  • rate="low"

  • rate="default"

Volume Attribute#

Riva supports the volume attribute as described in the SSML specs. The volume attribute supports a range of [-13, 8]dB. Values outside this range result in an error being logged and no audio returned. Tags silent, x-soft, soft, medium, loud, x-loud, and default are supported.

The volume attribute is expressed in the following formats:

  • volume="+1dB"

  • volume="-5.7dB"

  • volume="x-loud"

  • volume="default"

Examples#

<speak>
    <prosody pitch="1.0">Now I'm speaking a</prosody>
    <prosody pitch="2.0">bit</prosody>
    higher.
</speak>
<speak><prosody rate="200%">This is a fast sentence.</prosody></speak>

Phoneme#

Use the phoneme tag to override the pronunciation of words from the predicted pronunciation. For a given word or sequence of word, provide an explicit pronunciation by setting the ph attribute and the phone set used with the alphabet attribute. Currently, only x-arpabet is supported for pronunciation dictionaries based on CMUdict. IPA support will be added soon.

The full list of phonemes in the CMUdict can be found at https://github.com/cmusphinx/cmudict/blob/master/cmudict.phones. The list of supported symbols with stress can be found at https://github.com/cmusphinx/cmudict/blob/master/cmudict.symbols. For a mapping of these phones to English sounds, refer to the ARPABET Wikipedia page at https://en.wikipedia.org/wiki/ARPABET.

Examples#

<speak>
    Same thing!
    <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@EY1}{@T}{@OW2}">tomato</phoneme>,
    <phoneme alphabet="x-arpabet" ph="{@T}{@AH0}{@M}{@AA1}{@T}{@OW2}">tomato</phoneme>.
</speak>

Try It Now#

For SSML examples with sample audio, refer to the TTS Tutorials section.