Is this page helpful?

TTS NIM Performance#

This page provides first-chunk latency, inter-chunk latency, and throughput benchmarks for the NVIDIA TTS NIM microservice across supported GPUs.

Evaluation Process#

Benchmarks measure latency and throughput across varying numbers of parallel streams. Each stream performs 20 iterations over 10 input strings from the LJSpeech dataset, sending a new request only after receiving all audio chunks from the previous one. Three latency metrics are captured:

First-chunk latency: Time from request submission to receiving the first audio chunk.
Inter-chunk latency: Time between successive audio chunks.
Throughput: Measured in RTFX (duration of audio generated divided by computation time).

The following diagram shows how these latencies are measured:

Schematic Diagram of Latencies Measured by Riva Streaming TTS Client

Benchmarks use the riva_tts_perf_client provided in the Riva image. The source code is available at Riva C++ Clients.

The following command generates the results tables:

riva_tts_perf_client \
    --num_parallel_requests=<num_streams> \
    --num_iterations=<20*num_streams> \
    --online=true \
    --text_file=$test_file \
    --write_output_audio=false

The test_file is a path to the ljs_audio_text_test_filelist_small.txt file.

Results#

The following tables report first-chunk latency, inter-chunk latency, and throughput (RTFX).

Note

All values are averages over three trials, rounded to the last significant digit based on standard deviation. If a standard deviation is less than 0.001 of the average, the value is rounded as if the standard deviation equals 0.001 of the average.

For the hardware used in these measurements, refer to the Hardware Specifications section.

On-Prem

A100

Magpie TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	68.28	72.77	81.9	83.16	10.97	11.48	11.65	19.12	10.24
2	77.13	82.17	82.44	82.44	12.03	12.51	12.66	23.58	19.2
4	80.79	91.59	92.74	93.39	13.54	14.21	14.41	26.68	32.38
8	92.46	102.23	104.14	104.92	16.95	18.17	18.54	32.87	52.49
16	111.29	125.2	133.54	137.36	23.0	25.01	25.6	45.47	82.33
32	162.66	182.99	198.37	210.97	41.47	46.0	47.27	82.23	113.22
64	302.82	347.56	350.72	359.11	85.4	94.12	104.36	172.19	140.66

Magpie TTS Zeroshot

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	373	371	406	440	88	105	108	114	1.036
2	4080	7515	8111	8680	110	160	200	218	1.08

Magpie TTS Flow

# of streams	Throughput (RTFX)
1	3.94927
2	4.22968
4	4.32184
6	4.33257

Chatterbox TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	527.47	532.18	533.35	549.47	485.18	521.37	523.98	527.16	1.43
2	594.78	683.04	695.75	695.75	491.99	539.70	607.11	664.94	2.67
4	700.22	819.65	837.37	890.56	514.43	633.89	678.19	722.55	5.00
8	784.56	970.46	1166.24	1465.38	547.36	692.56	757.88	860.11	8.51
16	926.78	1162.61	1339.91	2165.71	597.32	837.57	1046.47	1845.82	16.25
32	2268.96	3693.08	4635.20	6645.20	849.18	1156.59	2265.26	3032.59	20.55
64	3746.55	5616.85	6452.63	8958.53	1302.96	1851.52	2297.68	3971.97	23.90

H100

Magpie TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	70.0	77.91	87.88	88.94	13.22	14.18	14.36	16.89	8.78
2	68.15	79.56	80.8	80.8	10.1	10.47	10.74	20.53	23.14
4	69.96	77.65	78.79	79.43	11.06	11.63	12.12	22.21	42.53
8	76.04	83.34	89.56	92.24	13.31	14.15	14.54	26.5	72.05
16	92.31	104.34	108.72	111.86	18.61	20.38	20.87	36.97	108.38
32	128.6	142.67	149.09	156.31	28.39	32.87	34.37	55.44	150.4
64	231.26	264.24	269.21	284.38	66.08	75.56	82.17	136.94	192.18

Magpie TTS Zeroshot

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	265	265	295	309	59	65	67	71	1.553
2	2740	5525	5672	6118	80	124	135	149	1.529

Magpie TTS Flow

# of streams	Throughput (RTFX)
1	6.03871
2	6.60842
4	6.72419
6	6.74131

Chatterbox TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	423.94	429.91	438.80	445.88	369.21	400.24	402.70	409.05	1.85
2	474.99	543.06	549.29	549.29	376.28	415.45	468.22	494.15	3.52
4	536.60	658.44	752.02	777.40	390.85	506.86	533.22	615.36	6.55
8	588.99	754.41	780.87	838.02	417.08	543.64	563.76	698.93	11.19
16	835.23	1175.01	1741.68	3234.17	468.67	721.56	942.42	1367.18	20.63
32	1391.40	1767.97	3442.42	3826.35	542.81	727.59	832.81	2368.11	32.87
64	2746.89	4092.95	6124.62	7842.92	968.19	1347.06	1948.58	4272.14	32.93

L40

Magpie TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	62.81	71.92	81.7	82.85	10.51	11.15	11.25	17.7	13.0
2	64.47	74.58	78.15	78.15	9.9	10.42	10.65	19.56	25.46
4	67.78	76.04	76.3	76.45	11.0	11.43	11.68	22.22	44.77
8	74.48	81.22	81.88	90.68	13.43	14.41	14.69	26.75	73.82
16	94.72	109.98	112.41	115.26	19.2	20.88	21.32	38.35	111.73
32	137.4	159.73	163.7	171.62	32.59	36.84	37.59	65.49	136.01

Magpie TTS Zeroshot

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	327	329	344	361	77	93	96	102	1.182
2	3967	6917	7213	8187	94	139	161	190	1.2

Magpie TTS Flow

# of streams	Throughput (RTFX)
1	4.8313
2	5.09014
4	5.22121
6	5.28647

Chatterbox TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	394.96	396.26	396.54	398.57	336.20	363.12	364.26	367.14	2.02
2	447.74	551.80	573.31	573.31	351.59	386.18	419.66	455.27	3.79
4	496.33	582.65	592.32	659.66	367.37	449.65	476.30	529.05	6.99
8	523.10	640.65	678.32	890.31	373.07	478.13	540.94	596.96	13.74
16	863.12	1555.76	1745.35	1931.19	403.35	696.65	851.44	1263.60	22.64
32	1321.85	1858.44	1901.75	2427.36	504.82	752.50	861.44	1069.70	33.70
64	2311.47	4061.38	4281.49	5069.28	796.91	1123.11	1297.64	2804.21	40.44

B200

Magpie TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	55.1	59.7	60.77	62.7	3.55	3.74	3.81	13.12	11.78
2	54.35	67.42	67.69	67.69	4.37	5.23	5.51	14.35	18.53
4	58.45	67.52	68.78	69.75	6.17	7.24	7.79	12.38	38.73
8	63.77	71.7	74.76	76.55	9.08	11.34	11.78	16.96	76.09
16	85.03	94.03	97.84	104.93	16.96	20.39	21.23	36.6	111.48
32	126.24	145.96	149.21	156.29	31.62	39.62	41.08	72.83	172.14
64	184.15	233.75	246.01	272.86	48.66	74.03	82.55	142.27	180.81

Chatterbox TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	352.94	355.54	356.76	411.93	286.32	308.77	310.50	313.72	2.36
2	384.05	449.78	488.24	488.24	294.96	324.26	344.36	371.69	4.50
4	438.47	509.67	536.94	567.73	323.41	382.54	407.61	489.88	8.02
8	456.86	537.34	539.37	546.89	323.56	409.28	423.95	469.68	15.95
16	588.68	785.38	981.76	1225.87	365.26	488.98	560.08	816.19	27.21
32	1139.84	1528.95	1714.38	2109.88	451.16	627.62	679.19	1367.68	36.69
64	2239.96	4264.38	5004.28	7587.56	775.94	1059.72	1309.91	3254.42	41.70

DGX_Spark

Magpie TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	61.31	73.38	75.05	78.39	6.35	7.85	8.36	15.36	7.67
2	64.2	71.11	73.22	73.22	7.25	9.11	10.19	13.19	12.7
8	107.19	134.55	140.96	144.79	19.11	25.91	27.76	35.27	31.29
16	191.47	228.95	237.6	260.4	40.77	54.3	58.62	82.0	39.51
32	351.08	429.41	453.12	477.54	95.67	122.53	132.82	199.9	48.52
64	704.95	838.12	866.62	925.68	210.05	256.88	274.39	436.56	51.74

Chatterbox TTS Multilingual

# of streams	Latency to first audio (ms)				Latency between audio chunks (ms)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	357.52	427.04	437.72	459.94	240.72	282.86	316.49	351.08	2.74
2	436.87	495.28	606.57	606.57	294.94	350.69	368.72	431.68	4.47
4	616.65	792.50	807.07	894.74	368.98	491.29	522.38	619.14	6.75
8	885.82	1082.45	1150.24	1389.14	477.11	660.66	711.70	871.46	10.17
16	1717.39	2572.25	3080.30	4594.59	769.74	1181.94	1472.03	2152.18	12.07
32	2813.28	3753.18	3990.96	4873.21	1241.74	1816.23	2046.19	2322.51	15.02
64	6670.24	9271.26	11739.70	14130.90	2409.61	3749.86	4152.97	5401.64	14.50

On-Prem Hardware Specifications#

A100

GPU
NVIDIA DGX A100 40GB
CPU
Model	AMD EPYC 7742 64-Core Processor
Thread(s) per core	2
Socket(s)	2
Core(s) per socket	64
NUMA node(s)	8
Frequency boost	enabled
CPU max MHz	2250
CPU min MHz	1500
RAM
Model	Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz
Configured Memory Speed	2933 MT/s
RAM Size	32x64GB (2048GB Total)

H100

GPU
NVIDIA H100 80GB HBM3
CPU
Model	Intel(R) Xeon(R) Platinum 8480CL
Thread(s) per core	2
Socket(s)	2
Core(s) per socket	56
NUMA node(s)	2
CPU max MHz	3800
CPU min MHz	800
RAM
Model	Micron DDR5 MTC40F2046S1RC48BA1 4800MHz
Configured Memory Speed	4400 MT/s
RAM Size	32x64GB (2048GB Total)

L40

GPU
NVIDIA L40
CPU
Model	AMD EPYC 7763 64-Core Processor
Thread(s) per core	1
Socket(s)	2
Core(s) per socket	64
NUMA node(s)	8
Frequency boost	enabled
CPU max MHz	3529
CPU min MHz	1500
RAM
Model	Samsung DDR4 M393A4K40DB3-CWE 3200MHz
Configured Memory Speed	3200 MT/s
RAM Size	16x32GB (512GB Total)

Performance Considerations#

Under high load, requests can time out because the server completes the current request before starting a new one to free the inference slot. This behavior maximizes throughput and supports real-time interaction.

Model Accuracy#

TTS model accuracy is evaluated using an ASR-based round-trip approach:

The TTS model generates synthetic speech from input text.
An ASR system transcribes the generated audio.
The ASR transcription is compared with the original input text using Character Error Rate (CER).

CER measures the percentage of characters that differ between the original text and the ASR transcription. Lower CER indicates better synthesis quality – the speech was clear enough for ASR to accurately recover the original text.

Model	Language	Dataset	CER % ⬇️	ASR model used
Magpie TTS Multilingual	English	subset of LibriTTS dev set	1.0	stt_en_conformer_transducer_large
	Spanish	CML Spanish test set	1.1	whisper-large-v3
	French	CML French test set	3.9	whisper-large-v3
	German	CML German test set	1.26	whisper-large-v3
Magpie TTS Zeroshot	English	subset of LibriTTS dev clean set (unseen)	0.41	stt_en_conformer_transducer_large
Magpie TTS Flow	English	subset of LibriTTS dev clean set (unseen)	1.43	stt_en_conformer_transducer_large

Note

Metrics are calculated on a subset of the LibriTTS dev-clean split for English and the CML dataset for French and Spanish. The subset includes only speakers with at least five utterances of at least five seconds each. Reported values are averages over multiple iterations.