Text-To-Speech (Latest)
Text-To-Speech (Latest) Download PDF

Performance

This section presents latency and throughput numbers of the Riva text-to-speech (TTS) service on different GPUs. Performance of the TTS service was measured for a different number of parallel streams. Each parallel stream performed 20 iterations over 10 input strings from the LJSpeech dataset. Each stream sends a request to the Riva server and waits for all audio chunks to have been received before sending another request. Latency to first audio chunk, latency between successive audio chunks, and throughput were measured. The following diagram shows how the latencies are measured.

riva-tts-latency.png

The FastPitch and HiFi-GAN models were tested.

The Riva TTS perf client riva_tts_perf_client, provided in the Riva image, was used to measure performance. The source code of the client can be obtained from https://github.com/nvidia-riva/cpp-clients.

The following command was used to generate the following tables:

Copy
Copied!
            

riva_tts_perf_client \ --num_parallel_requests=<num_streams> \ --voice_name=English-US.Female-1 \ --num_iterations=<20*num_streams> \ --online=true \ --text_file=$test_file \ --write_output_audio=false

Where test_file is a path to the ljs_audio_text_test_filelist_small.txt file.

Latencies to first audio chunk, latencies between audio chunks, and throughput are reported in the following tables. Throughput (duration of audio generated / computation time) is measured in RTFX.

Note

The values in the tables are average values over three trials. The values in the table are rounded to the last significant digit according to the standard deviation calculated on three trials. If a standard deviation is less than 0.001 of the average, then the corresponding value is rounded as if standard deviation equals 0.001 of the value.

For information about the hardware that collected these measurements, see the Hardware Specifications section.

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1 22 24.2 25 25.3 2.84 3.1 3.15 4.02 150.8
4 40 50 60 70 5 8 9 12 340
8 63 84 90 100 8 12 14 18 420
16 120 143 154 200 14.3 17.8 19.4 23 460
32 323 340 355 390 14.5 17.9 19.9 23.9 440

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1 17 19 19.3 20 2.5 3.035 3.08 3.16 185
4 30 42 50 60 4 6 7 9 430
8 60 80 80 90 6 10 11 14 500
16 100 120 130 2000 7.7 13 14.6 18.2 500
32 200 230 242 500 9.5 13 14.6 18.63 700

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1 21.5 24.3 24.7 25.5 2.4 3.3 3.5 4 162
4 40 55 60 70 5 7 8 10 300
8 60 80 86 100 6.8 10 11 13 440
16 100 122 133 170 9.7 14.4 16.4 21 600
32 300 310 320 2000 12 17 19.4 24 500

GPU

NVIDIA H100 80GB HBM3
CPU
Model Intel(R) Xeon(R) Platinum 8480CL
Thread(s) per core 2
Socket(s) 2
Core(s) per socket 56
NUMA node(s) 2
CPU max MHz 3800
CPU min MHz 800
RAM
Model Micron DDR5 MTC40F2046S1RC48BA1 4800MHz
Configured Memory Speed 4400 MT/s
RAM Size 32x64GB (2048GB Total)

GPU

NVIDIA DGX A100 40 GB
CPU
Model AMD EPYC 7742 64-Core Processor
Thread(s) per core 2
Socket(s) 2
Core(s) per socket 64
NUMA node(s) 8
Frequency boost enabled
CPU max MHz 2250
CPU min MHz 1500
RAM
Model Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz
Configured Memory Speed 2933 MT/s
RAM Size 32x64GB (2048GB Total)

GPU

NVIDIA L40
CPU
Model AMD EPYC 7763 64-Core Processor
Thread(s) per core 1
Socket(s) 2
Core(s) per socket 64
NUMA node(s) 8
Frequency boost enabled
CPU max MHz 3529
CPU min MHz 1500
RAM
Model Samsung DDR4 M393A4K40DB3-CWE 3200MHz
Configured Memory Speed 3200 MT/s
RAM Size 16x32GB (512GB Total)

When the server is under high load, requests might time out, as the server will not start inference for a new request until a previous request is completely generated so that inference slot can be freed. This is done to maximize throughput for the TTS service and allow for real-time interaction.

Previous Customization
Next Support Matrix
© Copyright © 2024, NVIDIA Corporation. Last updated on Aug 6, 2024.