Performance#

Evaluation Process#

This section presents the latency and throughput numbers of the Riva text-to-speech (TTS) service on different GPUs. The performance of the TTS service was measured for a different number of parallel streams. Each parallel stream performed 20 iterations over 10 input strings from the LJSpeech dataset. Each stream sends a request to the Riva server and waits for all audio chunks to have been received before sending another request. We measured the latency to the first audio chunk, the latency between successive audio chunks, and the overall throughput.

The following diagram shows how the latencies are measured.

Schematic Diagram of Latencies Measured by Riva Streaming TTS Client

We used the Riva TTS performance client (riva_tts_perf_client, provided in the Riva image), to measure performance. You can find the client’s source code in the Riva C++ Clients.

The following command was used to generate the following tables:

riva_tts_perf_client \
    --num_parallel_requests=<num_streams> \
    --num_iterations=<20*num_streams> \
    --online=true \
    --text_file=$test_file \
    --write_output_audio=false

Where test_file is a path to the ljs_audio_text_test_filelist_small.txt file.

Results#

The following tables report the latencies to the first audio chunk, the latencies between audio chunks, and the throughput. We measure throughput in RTFX (duration of audio generated / computation time).

Note

The values in the tables are average values over three trials. The values in the table are rounded to the last significant digit according to the standard deviation calculated on three trials. If a standard deviation is less than 0.001 of the average, then the corresponding value is rounded as if the standard deviation equals 0.001 of the value.

For information about the hardware that collected these measurements, refer to the Hardware Specifications section.

On-Prem

A100

FastPitch HiFi-GAN en-US

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	22	24.2	25	25.3	2.84	3.1	3.15	4.02	150.8
4	40	50	60	70	5	8	9	12	340
8	63	84	90	100	8	12	14	18	420
16	120	143	154	200	14.3	17.8	19.4	23	460
32	323	340	355	390	14.5	17.9	19.9	23.9	440

Magpie TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	77	82	85	91	15	16	16	17	10.4274
2	84	96	97	98	16	17	17	18	20.429
4	88	98	103	108	17	18	18	19	35.485
8	106	113	115	118	22	23	24	24	57.592
16	128	149	153	160	31	34	35	36	92.004
32	192	206	210	229	49	53	54	56	123.642
64	305	364	385	418	99	112	115	121	152.836

Magpie TTS Zeroshot

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	373	371	406	440	88	105	108	114	1.036
2	4080	7515	8111	8680	110	160	200	218	1.08

Magpie TTS Flow

# of streams	Throughput (RTFX)
1	3.94927
2	4.22968
4	4.32184
6	4.33257

H100

FastPitch HiFi-GAN en-US

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	17	19	19.3	20	2.5	3.035	3.08	3.16	185
4	30	42	50	60	4	6	7	9	430
8	60	80	80	90	6	10	11	14	500
16	100	120	130	2000	7.7	13	14.6	18.2	500
32	200	230	242	500	9.5	13	14.6	18.63	700

Magpie TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	96	104	104	105	25	27	27	28	7.58652
2	121	129	130	132	25	26	26	28	12.8315
4	84	96	99	101	16	17	17	19	39.7693
8	104	112	113	115	22	23	23	24	62.6207
16	115	131	132	139	26	27	27	28	105.581
32	152	168	176	188	37	41	42	44	140.663
64	238	258	263	268	61	67	68	70	170.573

Magpie TTS Zeroshot

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	265	265	295	309	59	65	67	71	1.553
2	2740	5525	5672	6118	80	124	135	149	1.529

Magpie TTS Flow

# of streams	Throughput (RTFX)
1	6.03871
2	6.60842
4	6.72419
6	6.74131

L40

FastPitch HiFi-GAN en-US

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	21.5	24.3	24.7	25.5	2.4	3.3	3.5	4	162
4	40	55	60	70	5	7	8	10	300
8	60	80	86	100	6.8	10	11	13	440
16	100	122	133	170	9.7	14.4	16.4	21	600
32	300	310	320	2000	12	17	19.4	24	500

Magpie TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	56	64	64	73	6	7	7	7	16
2	57	64	65	69	7	7	7	8	29
4	62	70	76	82	8	10	10	10	49
8	67	75	78	79	11	11	11	12	88
16	89	101	106	110	18	19	20	20	125
32	137	163	166	171	33	37	38	39	145

Magpie TTS Zeroshot

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	327	329	344	361	77	93	96	102	1.182
2	3967	6917	7213	8187	94	139	161	190	1.2

Magpie TTS Flow

# of streams	Throughput (RTFX)
1	4.8313
2	5.09014
4	5.22121
6	5.28647

On-Prem Hardware Specifications#

A100

GPU
NVIDIA DGX A100 40GB
CPU
Model	AMD EPYC 7742 64-Core Processor
Thread(s) per core	2
Socket(s)	2
Core(s) per socket	64
NUMA node(s)	8
Frequency boost	enabled
CPU max MHz	2250
CPU min MHz	1500
RAM
Model	Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz
Configured Memory Speed	2933 MT/s
RAM Size	32x64GB (2048GB Total)

H100

GPU
NVIDIA H100 80GB HBM3
CPU
Model	Intel(R) Xeon(R) Platinum 8480CL
Thread(s) per core	2
Socket(s)	2
Core(s) per socket	56
NUMA node(s)	2
CPU max MHz	3800
CPU min MHz	800
RAM
Model	Micron DDR5 MTC40F2046S1RC48BA1 4800MHz
Configured Memory Speed	4400 MT/s
RAM Size	32x64GB (2048GB Total)

L40

GPU
NVIDIA L40
CPU
Model	AMD EPYC 7763 64-Core Processor
Thread(s) per core	1
Socket(s)	2
Core(s) per socket	64
NUMA node(s)	8
Frequency boost	enabled
CPU max MHz	3529
CPU min MHz	1500
RAM
Model	Samsung DDR4 M393A4K40DB3-CWE 3200MHz
Configured Memory Speed	3200 MT/s
RAM Size	16x32GB (512GB Total)

Performance Considerations#

When the server is under high load, requests might time out, as the server will not start inference for a new request until a previous request is completely generated so that the inference slot can be freed. This is done to maximize throughput for the TTS service and allow for real-time interaction.

Model Accuracy#

Riva evaluates TTS model accuracy using an automated approach that leverages Automatic Speech Recognition (ASR). The process works as follows:

The TTS model generates synthetic speech from input text
This generated audio is then passed through an ASR system
The ASR transcription is compared with the original input text using Character Error Rate (CER)

The Character Error Rate measures the percentage of characters that differ between the original text and the ASR transcription of the synthesized speech. A lower CER indicates better TTS quality, as it means the synthesized speech was clear enough for ASR to accurately transcribe it back to the original text.

Model	Language	Dataset	CER % ⬇️	ASR model used
Magpie TTS Multilingual	English	subset of LibriTTS dev set	1.0	stt_en_conformer_transducer_large
	Spanish	CML Spanish test set	1.1	whisper-large-v3
	French	CML French test set	3.9	whisper-large-v3
	German	CML German test set	1.26	whisper-large-v3
Magpie TTS Zeroshot	English	subset of LibriTTS dev clean set (unseen)	0.41	stt_en_conformer_transducer_large
Magpie TTS Flow	English	subset of LibriTTS dev clean set (unseen)	1.43	stt_en_conformer_transducer_large

Note

We performed metrics calculations on a subset of the dev-clean split of LibriTTS for English and the CML dataset for French and Spanish. For our analysis, we selected a subset of samples from the total available samples, ensuring that all speakers had at least five utterances of at least five seconds each. The reported metrics are the average values obtained from multiple iterations, ensuring a more efficient and reliable evaluation of the metrics.

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	56	64	64	73	6	7	7	7	16
2	57	64	65	69	7	7	7	8	29
4	62	70	76	82	8	10	10	10	49
8	67	75	78	79	11	11	11	12	88
16	89	101	106	110	18	19	20	20	125
32	137	163	166	171	33	37	38	39	145

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	56	64	64	73	6	7	7	7	16
2	57	64	65	69	7	7	7	8	29
4	62	70	76	82	8	10	10	10	49
8	67	75	78	79	11	11	11	12	88
16	89	101	106	110	18	19	20	20	125
32	137	163	166	171	33	37	38	39	145

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	56	64	64	73	6	7	7	7	16
2	57	64	65	69	7	7	7	8	29
4	62	70	76	82	8	10	10	10	49
8	67	75	78	79	11	11	11	12	88
16	89	101	106	110	18	19	20	20	125
32	137	163	166	171	33	37	38	39	145