Performance

Evaluation Process#

This section presents latency and throughput numbers of the Riva text-to-speech (TTS) service on different GPUs. Performance of the TTS service was measured for a different number of parallel streams. Each parallel stream performed 20 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk, latency between successive audio chunks, and throughput were measured. The FastPitch and HiFi-GAN models were tested.

The command used to measure performance was:

riva_tts_perf_client \
    --num_parallel_requests=<num_streams> \
    --voice_name=English-US.Female-1 \
    --num_iterations=<20*num_streams> \
    --online=true \
    --text_file=$test_file \
    --write_output_audio=false

Where test_file is a path to the ljs_audio_text_test_filelist_small.txt file.

Results#

Latencies to first audio chunk, latencies between audio chunks, and throughput are reported in the following tables. Throughput is measured in RTFX (duration of audio generated / computation time).

For specifications of the hardware on which these measurements were collected, refer to the Hardware Specifications section.

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.021	0.022	0.023	0.026	0.003	0.004	0.004	0.006	144.656
4	0.037	0.049	0.053	0.066	0.006	0.009	0.011	0.015	335.866
6	0.046	0.063	0.069	0.085	0.007	0.012	0.013	0.018	394.761
8	0.056	0.074	0.081	0.092	0.009	0.014	0.017	0.021	420.926
10	0.059	0.078	0.085	0.100	0.010	0.015	0.017	0.022	433.471
16	0.134	0.174	0.194	0.271	0.014	0.020	0.022	0.027	425.258
32	0.339	0.384	0.399	0.427	0.015	0.021	0.023	0.028	437.319

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.022	0.024	0.024	0.028	0.004	0.005	0.005	0.006	126.690
4	0.044	0.058	0.063	0.072	0.007	0.014	0.016	0.018	267.410
6	0.064	0.087	0.094	0.105	0.009	0.017	0.019	0.024	291.877
8	0.082	0.109	0.119	0.135	0.011	0.020	0.023	0.031	309.784
10	0.091	0.123	0.134	0.152	0.013	0.022	0.026	0.035	318.238
16	0.196	0.249	0.266	0.327	0.014	0.028	0.033	0.043	331.539
32	0.427	0.516	0.544	0.603	0.019	0.031	0.036	0.045	348.603

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.021	0.024	0.024	0.025	0.004	0.005	0.005	0.005	127.362
4	0.049	0.065	0.071	0.080	0.008	0.016	0.019	0.022	235.336
6	0.072	0.099	0.106	0.119	0.011	0.019	0.022	0.028	249.659
8	0.096	0.132	0.143	0.160	0.014	0.024	0.028	0.038	255.642
10	0.108	0.151	0.164	0.189	0.017	0.028	0.033	0.043	255.840
16	0.218	0.293	0.318	0.380	0.020	0.034	0.039	0.051	277.610
32	0.521	0.626	0.658	0.720	0.024	0.039	0.044	0.055	283.774

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.024	0.026	0.027	0.027	0.005	0.006	0.006	0.006	104.242
4	0.055	0.076	0.082	0.095	0.009	0.015	0.017	0.024	215.233
6	0.080	0.112	0.125	0.150	0.012	0.019	0.022	0.030	227.392
8	0.108	0.153	0.165	0.184	0.015	0.024	0.027	0.034	232.032
10	0.119	0.164	0.175	0.196	0.018	0.027	0.030	0.037	234.513
16	0.238	0.310	0.333	0.398	0.022	0.033	0.037	0.045	253.833
32	0.562	0.654	0.680	0.736	0.026	0.036	0.041	0.050	263.735

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.050	0.056	0.056	0.068	0.007	0.008	0.008	0.010	63.768
4	0.096	0.127	0.135	0.156	0.016	0.032	0.037	0.042	121.263
6	0.142	0.195	0.211	0.249	0.022	0.040	0.046	0.061	127.092
8	0.188	0.255	0.275	0.312	0.028	0.048	0.055	0.076	131.907
10	0.218	0.300	0.321	0.354	0.030	0.053	0.061	0.082	133.859
16	0.412	0.542	0.585	0.707	0.042	0.067	0.077	0.097	142.174
32	1.024	1.214	1.273	1.457	0.047	0.076	0.087	0.114	144.742

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.065	0.067	0.068	0.069	0.019	0.020	0.020	0.021	18.106

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.084	0.085	0.086	0.104	0.019	0.020	0.020	0.021	13.954

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.035	0.038	0.040	0.051	0.006	0.007	0.008	0.015	36.728

Hardware Specifications#

GPU
NVIDIA DGX A100 40 GB
CPU
Model	AMD EPYC 7742 64-Core Processor
Thread(s) per core	2
Socket(s)	2
Core(s) per socket	64
NUMA node(s)	8
Frequency boost	enabled
CPU max MHz	2250
CPU min MHz	1500
RAM
Model	Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz
Configured Memory Speed	2933 MT/s
RAM Size	32x64GB (2048GB Total)

GPU
NVIDIA A30
CPU
Model	AMD EPYC 7742 64-Core Processor
Thread(s) per core	1
Socket(s)	2
Core(s) per socket	64
NUMA node(s)	2
Frequency boost	disabled
CPU max MHz	2250.0000
CPU min MHz	1500.0000
RAM
Model	Samsung DDR4 M393A4K40DB3-CWE 3200MHz
Configured Memory Speed	3200 MT/s
RAM Size	32x64GB (2048GB Total)

GPU
NVIDIA A10
CPU
Model	AMD EPYC 7763 64-Core Processor
Thread(s) per core	1
Socket(s)	2
Core(s) per socket	64
NUMA node(s)	8
Frequency boost	enabled
CPU max MHz	2450
CPU min MHz	1500
RAM
Model	Samsung DDR4 M393A4K40DB3-CWE 3200 MHz
Configured Memory Speed	3200 MT/s
RAM Size	16x32GB (512GB Total)

GPU
NVIDIA V100 SXM2 16 GB
CPU
Model	Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Thread(s) per core	2
Socket(s)	2
Core(s) per socket	20
NUMA node(s)	2
CPU max MHz	3600
CPU min MHz	1200
RAM
Model	Micron DDR4 36ASF4G72PZ-2G6D1 2667MHz
Configured Memory Speed	2133 MT/s
RAM Size	16x32GB (512GB Total)

GPU
NVIDIA T4
CPU
Model	Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Thread(s) per core	2
Socket(s)	2
Core(s) per socket	18
NUMA node(s)	2
CPU max MHz	3900
CPU min MHz	1000
RAM
Model	Samsung DDR4 M393A2K43BB1-CTD 2666MHz
Configured Memory Speed	2666 MT/s
RAM Size	24x16GB (384GB Total)

Performance Considerations#

When the server is under high load, requests might time out, as the server will not start inference for a new request until a previous request is completely generated so that inference slot can be freed. This is done to maximize throughput for the TTS service and allow for real-time interaction.

NVIDIA Riva

Contents

Performance#

Evaluation Process#

Results#

Hardware Specifications#

Performance Considerations#