Text To Speech

Isaac SDK Text to Speech generates life-like speech from any text using deep learning networks. It can be used to develop voice UI applications. This pipeline runs natively on desktop systems and Jetson Xavier platforms and delivers low latency streaming performance. This allows applications to generate arbitrary audio messages on-the-fly instead of playing back pre-recorded audio files.

The Text to Speech pipeline is built with 3 codelets: Text To Mel, Torch Inference, and Tensor To Audio Decoder. The first component Text To Mel accepts a text string. The last component Tensor To Audio Decoder publishes audio samples which can be sent to Audio Playback.

Text To Mel

The Text to Mel codelet receives text as input and generates a corresponding Mel spectrogram as output. It uses the NVIDIA implementation of the Tacotron-2 Deep Learning network. The model maps a sequence of characters to a sequence of mel spectrums.

This codelet runs the model in streaming mode. Instead of publishing the full Mel spectrogram of the input, it publishes partial Mel spectrograms as they are generated. This helps reduce the overall latency as the partial data can be used by other components while the next set of partial data is being generated by TextToMel. This design effectively works as a pipeline and can allow real-time speech synthesis.


Isaac SDK Text to Speech has the following limitations:

  • The codelet supports input text as alphabets and ARPAbets (for well defined pronunciation).

    Example 1 (alphabets): The quick brown fox jumps over the lazy dog.

    Example 2 (ARPAbets): {HH AH0 L OW1} {AY1} {W AA1 N T} {T UW1} {G OW1} {T UW1} {S AE1 N T AH0}

    {K L AE1 R AH0} {AH0 N D} {SH AA1 P} {TH IH1 NG Z} {AE1 T} {EH2 L AH0 M EH1 N T AH0 L}.

  • Upper-case and lower-case alphabets, period (.), comma (,) and space are the only supported characters. Numbers and special characters are not supported.

  • In ARPAbet, 39 phonemes and 3 lexical stress markers (0,1,2) for vowels are supported. This is the same phoneme set used in the CMU Pronouncing Dictionary. ARPAbet notation follows phonemes separated by space and words encased within curly brackets ({, }).

  • Sentences should be grammatically correct including punctuations for more accurate pronunciation.

  • Each input message can contain only one sentence. More sentences per message can lead to faulty pronunciations.

  • Like every other deep learning model, a warmup iteration is required. The warmup iteration (first input message) initializes the GPU kernels used by the model and hence takes significantly longer time to process.

  • Pronunciation of long words, such as claustrophobia and antidisestablishmentarianism, can go wrong towards the end of the word due to the model being optimized to run on device. It is recommended to avoid such words or use ARPAbet notation for them.

  • Acronyms should be spelled as individual characters using ARPAbet. For example, “BMW” has to be spelled “{B IY} {EH M} {D AH B AH L Y UW}”.

  • Proper nouns might be spoken differently than their spelling. Hence it is recommended to use ARPAbet notation for proper nouns and uncommon words. For example, Nvidia can written as {EH0 N V IH1 D IY0 AH0}

Configuration Parameters

Parameter Description Default Type

The session timeout value determines maximum allowed time for a streaming session within a tick before being terminated. Termination results in the remaining part of the mel spectrogram not getting generated. The next input text message is processed normally in the next tick with a new session.

This value indirectly constraints the maximum length of sentence that can be converted to speech. A larger value should be used for longer sentences. A smaller value generates only partial speech from the longer sentences.

25.0 double


Message Proto Type Name
Input ChatMessageProto text
Output TensorListProto mel_spectrogram

Platforms: Desktop, Jetson Xavier

WaveGlow Vocoder (Torch Inference)

Vocoder converts the Mel spectrograms (frequency domain) into corresponding naturally sounding audio samples in time domain. The NVIDIA WaveGlow network is used as a vocoder to synthesize speech from mel spectrograms. The pre-trained WaveGlow model is loaded into Torch Inference codelet for execution.

Tensor To Audio Decoder

The Tensor to Audio Decoder codelet repackages the audio samples from TensorListProto to AudioDataProto. The audio data is passed on from the input to output without any modification. Only the metadata required in AudioDataProto is added from the configuration parameters.

Configuration Parameters

Parameter Description Default Type
sample_rate Sample rate of the audio received 22050 int
num_channels Number of channels of the audio received 1 int


Message Proto Type Name
Input TensorListProto tensors
Output AudioDataProto audio

The text_to_speech Sample Application

The text_to_speech sample application demonstrates the end-to-end pipeline of the text to speech feature.

The application has Send Text, Text to Mel, Vocoder (Torch inference), Tensor to Audio, and Audio Playback components connected in order. The Send Text component publishes a configured list of sentences in a loop with a 6-second interval between sentences. The Text to Mel component generates partial Mel spectrograms for these input sentences configured with a timeout of 25 seconds. The Torch Inference component with WaveGlow model converts these partial Mel spectrograms into audio samples. This component is configured to process unread buffered messages that are up to 8 seconds old. The audio samples generated by the WaveGlow Torch inference are converted from TensorListProto to AudioDataProto by the Tensor to Audio Decoder component. The AudioDataProto messages are played on the speaker by the Audio Playback component.

To use the application, connect a speaker or headphones to the host or device and set it as default audio output device in system settings. The configured text list can be heard as synthesized speech played through the audio output device.

Since this application runs two heavy deep learning networks, it is recommended to set Jetson Xavier to MaxN mode and max clock speed (as shown below) before launching the application on it.

ubuntu@Jetson_Xavier:~$ sudo nvpmodel -m 0
ubuntu@Jetson_Xavier:~$ sudo jetson_clocks

Platforms: Desktop, Jetson Xavier

Hardware: Any speaker/headphones