Performance#

Evaluation Process#

This section presents latency and throughput numbers of the Riva text-to-speech (TTS) service on different GPUs. Performance of the TTS service was measured for a different number of parallel streams. Each parallel stream performed 20 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk, latency between successive audio chunks, and throughput were measured. The FastPitch and HiFi-GAN models were tested.

The command used to measure performance was:

riva_tts_perf_client \
    --num_parallel_requests=<num_streams> \
    --voice_name=English-US.Female-1 \
    --num_iterations=<20*num_streams> \
    --online=true \
    --text_file=$test_file \
    --write_output_audio=false

Where test_file is a path to the ljs_audio_text_test_filelist_small.txt file.

Results#

Latencies to first audio chunk, latencies between audio chunks, and throughput are reported in the following tables. Throughput is measured in RTFX (duration of audio generated / computation time).

For specifications of the hardware on which these measurements were collected, refer to the Hardware Specifications section.

# of streams

Latency to first audio (s)

Latency between audio chunks (s)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

0.021

0.022

0.023

0.026

0.003

0.004

0.004

0.006

144.656

4

0.037

0.049

0.053

0.066

0.006

0.009

0.011

0.015

335.866

6

0.046

0.063

0.069

0.085

0.007

0.012

0.013

0.018

394.761

8

0.056

0.074

0.081

0.092

0.009

0.014

0.017

0.021

420.926

10

0.059

0.078

0.085

0.100

0.010

0.015

0.017

0.022

433.471

16

0.134

0.174

0.194

0.271

0.014

0.020

0.022

0.027

425.258

32

0.339

0.384

0.399

0.427

0.015

0.021

0.023

0.028

437.319

Hardware Specifications#

GPU

NVIDIA DGX A100 40 GB

CPU

Model

AMD EPYC 7742 64-Core Processor

Thread(s) per core

2

Socket(s)

2

Core(s) per socket

64

NUMA node(s)

8

Frequency boost

enabled

CPU max MHz

2250

CPU min MHz

1500

RAM

Model

Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz

Configured Memory Speed

2933 MT/s

RAM Size

32x64GB (2048GB Total)

Performance Considerations#

When the server is under high load, requests might time out, as the server will not start inference for a new request until a previous request is completely generated so that inference slot can be freed. This is done to maximize throughput for the TTS service and allow for real-time interaction.