Performance¶

Evaluation Process¶

This section presents latency and throughput numbers of the Riva text-to-speech (TTS) service on different GPUs. Performance of the TTS service was measured for a different number of parallel streams. Each parallel stream performed 20 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk, latency between successive audio chunks, and throughput were measured. The FastPitch and HiFi-GAN, and Tacotron 2 and WaveGlow models were tested.

The command used to measure performance was:

riva_tts_perf_client \
    --num_parallel_requests=<num_streams> \
    --voice_name=English-US-Female-1 \
    --num_iterations=<20*num_streams> \
    --online=true \
    --text_file=$test_file \
    --write_output_audio=false

where test_file is a path to file ljs_audio_text_test_filelist_small.txt.

Results¶

Latencies to first audio chunk, latencies between audio chunks, and throughput are reported in the following tables. Throughput is measured in RTFX (duration of audio generated / computation time).

For specifications of the hardware on which these measurements were collected, refer to the Hardware Specifications section.

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.020	0.022	0.022	0.023	0.003	0.004	0.004	0.006	161.731
4	0.036	0.048	0.054	0.064	0.005	0.008	0.010	0.012	372.505
6	0.042	0.059	0.065	0.076	0.006	0.010	0.011	0.014	483.976
8	0.055	0.075	0.080	0.092	0.007	0.011	0.013	0.016	527.248
10	0.062	0.084	0.087	0.097	0.007	0.012	0.014	0.017	530.111

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.045	0.048	0.048	0.050	0.026	0.028	0.028	0.029	30.867
4	0.202	0.286	0.302	0.323	0.017	0.028	0.031	0.037	81.782
6	0.281	0.369	0.388	0.438	0.019	0.030	0.034	0.043	95.791
8	0.357	0.454	0.471	0.507	0.020	0.032	0.037	0.046	105.224
10	0.426	0.512	0.542	0.651	0.021	0.034	0.038	0.050	111.804

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.032	0.036	0.037	0.040	0.003	0.005	0.005	0.006	121.527
4	0.047	0.061	0.067	0.079	0.006	0.010	0.012	0.017	302.308
6	0.063	0.086	0.092	0.110	0.008	0.015	0.017	0.020	348.049
8	0.084	0.110	0.116	0.132	0.009	0.017	0.020	0.024	367.742
10	0.095	0.125	0.131	0.149	0.010	0.018	0.021	0.025	371.032

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.059	0.066	0.067	0.069	0.028	0.034	0.034	0.035	26.654
4	0.255	0.350	0.368	0.413	0.025	0.037	0.043	0.055	61.693
6	0.382	0.501	0.527	0.575	0.029	0.045	0.050	0.071	68.254
8	0.508	0.666	0.696	0.756	0.032	0.051	0.059	0.077	72.058
10	0.631	0.761	0.798	0.967	0.033	0.052	0.061	0.088	74.592

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.025	0.028	0.028	0.029	0.005	0.006	0.006	0.006	113.982
4	0.050	0.073	0.082	0.096	0.008	0.014	0.017	0.023	260.055
6	0.082	0.120	0.136	0.161	0.010	0.021	0.025	0.029	262.642
8	0.121	0.167	0.180	0.210	0.012	0.024	0.027	0.033	265.144
10	0.141	0.193	0.211	0.239	0.012	0.025	0.028	0.034	272.279

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.073	0.076	0.076	0.079	0.047	0.049	0.050	0.050	17.302
4	0.376	0.504	0.530	0.589	0.031	0.048	0.057	0.071	44.750
6	0.562	0.733	0.774	0.840	0.035	0.058	0.068	0.091	48.801
8	0.733	0.936	0.989	1.075	0.038	0.063	0.074	0.105	52.819
10	0.885	1.105	1.201	1.399	0.042	0.071	0.088	0.126	54.508

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.055	0.061	0.062	0.068	0.005	0.007	0.008	0.010	73.647
4	0.087	0.108	0.115	0.138	0.012	0.022	0.026	0.032	158.731
6	0.121	0.159	0.171	0.201	0.017	0.033	0.037	0.045	169.916
8	0.161	0.211	0.226	0.263	0.020	0.039	0.043	0.053	177.233
10	0.190	0.253	0.273	0.312	0.022	0.041	0.047	0.058	178.790

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.099	0.106	0.107	0.118	0.043	0.052	0.053	0.054	17.250
4	0.564	0.772	0.829	0.893	0.063	0.102	0.120	0.162	26.446
6	0.914	1.219	1.287	1.420	0.077	0.126	0.149	0.210	27.500
8	1.295	1.662	1.754	1.892	0.085	0.142	0.177	0.249	27.917
10	1.625	1.994	2.108	2.545	0.092	0.162	0.198	0.280	28.192

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.065	0.067	0.068	0.069	0.019	0.020	0.020	0.021	18.106

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.084	0.085	0.086	0.104	0.019	0.020	0.020	0.021	13.954

Hardware Specifications¶

GPU
NVIDIA DGX A100 40 GB
CPU
Model	AMD EPYC 7742 64-Core Processor
Thread(s) per core	2
Socket(s)	2
Core(s) per socket	64
NUMA node(s)	8
Frequency boost	enabled
CPU max MHz	2250
CPU min MHz	1500
RAM
Model	Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz
Configured Memory Speed	2933 MT/s
RAM Size	32x64GB (2048GB Total)

GPU
NVIDIA A30
CPU
Model	AMD EPYC 7742 64-Core Processor
Thread(s) per core	1
Socket(s)	2
Core(s) per socket	64
NUMA node(s)	2
Frequency boost	disabled
CPU max MHz	2250.0000
CPU min MHz	1500.0000
RAM
Model	Samsung DDR4 M393A4K40DB3-CWE 3200MHz
Configured Memory Speed	3200 MT/s
RAM Size	32x64GB (2048GB Total)

GPU
NVIDIA V100 SXM2 16 GB
CPU
Model	Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Thread(s) per core	2
Socket(s)	2
Core(s) per socket	20
NUMA node(s)	2
CPU max MHz	3600
CPU min MHz	1200
RAM
Model	Micron DDR4 36ASF4G72PZ-2G6D1 2667MHz
Configured Memory Speed	2133 MT/s
RAM Size	16x32GB (512GB Total)

GPU
NVIDIA T4
CPU
Model	Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Thread(s) per core	2
Socket(s)	2
Core(s) per socket	18
NUMA node(s)	2
CPU max MHz	3900
CPU min MHz	1000
RAM
Model	Samsung DDR4 M393A2K43BB1-CTD 2666MHz
Configured Memory Speed	2666 MT/s
RAM Size	24x16GB (384GB Total)

Performance Considerations¶

When the server is under high load, requests might time out, as the server will not start inference for a new request until a previous request is completely generated so that inference slot can be freed. This is done to maximize throughput for the TTS service and allow for real-time interaction. NVIDIA does not recommend making more than 8-10 simultaneous requests with the models provided in Riva 2.0.0.

NVIDIA Riva Skills

Performance

Contents

Performance¶

Evaluation Process¶

Results¶

Hardware Specifications¶

Performance Considerations¶