Customizing TTS Models#

SSML Customization#

Speech Synthesis Markup Language (SSML) controls the performance of the virtual speaker. The TTS NIM microservice supports a subset of SSML that lets you override pronunciation for specific words.

The following SSML tag is supported:

  • <phoneme>: Overrides pronunciation for specific words.

SSML Support by Model#

Model

Phoneme

Custom Dictionary

Magpie TTS Multilingual

Magpie TTS Zeroshot

Magpie TTS Flow

Note

All SSML inputs must be a valid XML document wrapped in a <speak> root tag. Input that is not valid XML, or valid XML with a different root tag, is treated as raw text.

Example#

Customize pronunciation with the phoneme tag:

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --text "<speak>You say <phoneme alphabet='ipa' ph='təˈmeɪˌtoʊ'>tomato</phoneme>, I say <phoneme alphabet='ipa' ph='təˈmɑˌtoʊ'>tomato</phoneme>.</speak>" \
    --language-code en-US

The synthesized audio file output.wav contains the resulting speech with the phoneme overrides applied.

Custom Pronunciation Dictionary#

The TTS NIM microservice supports custom pronunciation through a text-based dictionary that maps words (graphemes) to IPA phonetic representations (phonemes). Use the --custom-dictionary flag to pass the dictionary file to the client.

Dictionary format:

  • Each line contains a word followed by its pronunciation, separated by exactly two spaces.

  • Split multi-word entries into individual lines.

  • Refer to Phoneme Support for the list of supported IPA phonemes.

Example#

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --text "Today is a sunny day, a great day to eat fresh tomato" \
    --language-code en-US \
    --custom-dictionary custom_dict.txt

The custom dictionary file custom_dict.txt contains word-to-phoneme mappings:

sunny  ˈsʌnɪ
tomato  ˈtɑˌməʊ

The synthesized audio file output.wav contains the resulting speech with the custom pronunciations applied.

Emotion Exaggeration#

Chatterbox TTS Multilingual accepts a per-request exaggeration_factor parameter that controls how pronounced the emotional prosody is. Pass it through the --custom-configuration flag, which forwards comma-separated key:value pairs to the underlying model.

Parameter

Type

Range

Default

exaggeration_factor

float

0.252.0

0.5

Model Support#

Note

Values outside [0.25, 2.0] are rejected by the server with INVALID_ARGUMENT (gRPC) or HTTP 400 (Bad Request, invalid custom_configuration).

Example#

python3 python-clients/scripts/tts/talk.py --server 0.0.0.0:50051 \
    --text "I cannot believe this just happened!" \
    --language-code en-US \
    --voice Chatterbox-Multilingual.en-US.Male \
    --custom-configuration "exaggeration_factor:1.5"

To pass multiple parameters at once, separate key:value pairs with commas:

... --custom-configuration "exaggeration_factor:1.5,key2:value2"