Performance#

Evaluation Process#

This section presents latency and throughput numbers of the Riva text-to-speech (TTS) service on different GPUs. Performance of the TTS service was measured for a different number of parallel streams. Each parallel stream performed 20 iterations over 10 input strings from the LJSpeech dataset. Each stream sends a request to the Riva server and waits for all audio chunks to have been received before sending another request. Latency to first audio chunk, latency between successive audio chunks, and throughput were measured. The following diagram shows how the latencies are measured.

Schematic Diagram of Latencies Measured by Riva Streaming TTS Client

The FastPitch and HiFi-GAN models were tested.

The Riva TTS perf client riva_tts_perf_client, provided in the Riva image, was used to measure performance. The source code of the client can be obtained from https://github.com/nvidia-riva/cpp-clients.

The following command was used to generate the tables below:

riva_tts_perf_client \
    --num_parallel_requests=<num_streams> \
    --voice_name=English-US.Female-1 \
    --num_iterations=<20*num_streams> \
    --online=true \
    --text_file=$test_file \
    --write_output_audio=false

Where test_file is a path to the ljs_audio_text_test_filelist_small.txt file.

Results#

Latencies to first audio chunk, latencies between audio chunks, and throughput are reported in the following tables. Throughput is measured in RTFX (duration of audio generated / computation time).

Note

--num_iterations equals 100 for Xavier AGX, Xavier NX, and Orin AGX and 20 for all other measurements.

Note

The values in the tables are average values over 3 trials. The values in the table are rounded to the last significant digit according to standard deviation calculated on 3 trials. If a standard deviation is less than 0.001 of the average, then the corresponding value is rounded as if standard deviation equals 0.001 of the value.

For specifications of the hardware on which these measurements were collected, refer to the Hardware Specifications section. Please notice, that

  • results on AWS and GCP are computed using Riva 2.4.0

  • results On-Prem are computed using Riva 2.15.0.

Cloud instance descriptions for AWS and GCP.

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

22

24.2

25

25.3

2.84

3.1

3.15

4.02

150.8

4

40

50

60

70

5

8

9

12

340

8

63

84

90

100

8

12

14

18

420

16

120

143

154

200

14.3

17.8

19.4

23

460

32

323

340

355

390

14.5

17.9

19.9

23.9

440

On-Prem Hardware Specifications#

GPU

NVIDIA DGX A100 40 GB

CPU

Model

AMD EPYC 7742 64-Core Processor

Thread(s) per core

2

Socket(s)

2

Core(s) per socket

64

NUMA node(s)

8

Frequency boost

enabled

CPU max MHz

2250

CPU min MHz

1500

RAM

Model

Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz

Configured Memory Speed

2933 MT/s

RAM Size

32x64GB (2048GB Total)

Performance Considerations#

When the server is under high load, requests might time out, as the server will not start inference for a new request until a previous request is completely generated so that inference slot can be freed. This is done to maximize throughput for the TTS service and allow for real-time interaction.

Model Accuracy#

Riva evaluates TTS model accuracy using an automated approach that leverages Automatic Speech Recognition (ASR). The process works as follows:

  1. The TTS model generates synthetic speech from input text

  2. This generated audio is then passed through an ASR system

  3. The ASR transcription is compared with the original input text using Character Error Rate (CER)

The Character Error Rate measures the percentage of characters that differ between the original text and the ASR transcription of the synthesized speech. A lower CER indicates better TTS quality, as it means the synthesized speech was clear enough for ASR to accurately transcribe it back to the original text.

The following table shows the Character Error Rate (CER) scores for the Multilingual Magpie model.

Model

Language

Dataset

CER % ⬇️

ASR model used

Multilingual

English

subset of LibriTTS dev set

1.0

stt_en_conformer_transducer_large

Spanish

CML Spanish test set

1.1

whisper-large-v3

French

CML French test set

3.9

whisper-large-v3

Note

We performed metrics calculations on a subset of the dev-clean split of LibriTTS for English, and the CML dataset for French and Spanish. For our analysis, we selected a subset of samples from the total available samples, ensuring that all speakers had at least 5 utterances of at least 5 seconds each. The reported metrics are the average values obtained from multiple iterations, ensuring a more efficient and reliable evaluation of the metrics.