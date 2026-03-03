Is this page helpful?

Performance#

Evaluation Process#

This section presents the latency and throughput numbers of the Riva text-to-speech (TTS) service on different GPUs. The performance of the TTS service was measured for a different number of parallel streams. Each parallel stream performed 20 iterations over 10 input strings from the LJSpeech dataset. Each stream sends a request to the Riva server and waits for all audio chunks to have been received before sending another request. We measured the latency to the first audio chunk, the latency between successive audio chunks, and the overall throughput.

The following diagram shows how the latencies are measured.

Schematic Diagram of Latencies Measured by Riva Streaming TTS Client

We used the Riva TTS performance client (riva_tts_perf_client, provided in the Riva image), to measure performance. You can find the client’s source code in the Riva C++ Clients.

The following command was used to generate the following tables:

riva_tts_perf_client \
    --num_parallel_requests=<num_streams> \
    --num_iterations=<20*num_streams> \
    --online=true \
    --text_file=$test_file \
    --write_output_audio=false

Where test_file is a path to the ljs_audio_text_test_filelist_small.txt file.

Results#

The following tables report the latencies to the first audio chunk, the latencies between audio chunks, and the throughput. We measure throughput in RTFX (duration of audio generated / computation time).

Note

The values in the tables are average values over three trials. The values in the table are rounded to the last significant digit according to the standard deviation calculated on three trials. If a standard deviation is less than 0.001 of the average, then the corresponding value is rounded as if the standard deviation equals 0.001 of the value.

For information about the hardware that collected these measurements, refer to the Hardware Specifications section.

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

22

24.2

25

25.3

2.84

3.1

3.15

4.02

150.8

4

40

50

60

70

5

8

9

12

340

8

63

84

90

100

8

12

14

18

420

16

120

143

154

200

14.3

17.8

19.4

23

460

32

323

340

355

390

14.5

17.9

19.9

23.9

440

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

77

82

85

91

15

16

16

17

10.4274

2

84

96

97

98

16

17

17

18

20.429

4

88

98

103

108

17

18

18

19

35.485

8

106

113

115

118

22

23

24

24

57.592

16

128

149

153

160

31

34

35

36

92.004

32

192

206

210

229

49

53

54

56

123.642

64

305

364

385

418

99

112

115

121

152.836

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

373

371

406

440

88

105

108

114

1.036

2

4080

7515

8111

8680

110

160

200

218

1.08

# of streams

Throughput (RTFX)

1

3.94927

2

4.22968

4

4.32184

6

4.33257

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

17

19

19.3

20

2.5

3.035

3.08

3.16

185

4

30

42

50

60

4

6

7

9

430

8

60

80

80

90

6

10

11

14

500

16

100

120

130

2000

7.7

13

14.6

18.2

500

32

200

230

242

500

9.5

13

14.6

18.63

700

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

96

104

104

105

25

27

27

28

7.58652

2

121

129

130

132

25

26

26

28

12.8315

4

84

96

99

101

16

17

17

19

39.7693

8

104

112

113

115

22

23

23

24

62.6207

16

115

131

132

139

26

27

27

28

105.581

32

152

168

176

188

37

41

42

44

140.663

64

238

258

263

268

61

67

68

70

170.573

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

265

265

295

309

59

65

67

71

1.553

2

2740

5525

5672

6118

80

124

135

149

1.529

# of streams

Throughput (RTFX)

1

6.03871

2

6.60842

4

6.72419

6

6.74131

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

21.5

24.3

24.7

25.5

2.4

3.3

3.5

4

162

4

40

55

60

70

5

7

8

10

300

8

60

80

86

100

6.8

10

11

13

440

16

100

122

133

170

9.7

14.4

16.4

21

600

32

300

310

320

2000

12

17

19.4

24

500

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

56

64

64

73

6

7

7

7

16

2

57

64

65

69

7

7

7

8

29

4

62

70

76

82

8

10

10

10

49

8

67

75

78

79

11

11

11

12

88

16

89

101

106

110

18

19

20

20

125

32

137

163

166

171

33

37

38

39

145

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

327

329

344

361

77

93

96

102

1.182

2

3967

6917

7213

8187

94

139

161

190

1.2

# of streams

Throughput (RTFX)

1

4.8313

2

5.09014

4

5.22121

6

5.28647

On-Prem Hardware Specifications#

GPU

NVIDIA DGX A100 40GB

CPU

Model

AMD EPYC 7742 64-Core Processor

Thread(s) per core

2

Socket(s)

2

Core(s) per socket

64

NUMA node(s)

8

Frequency boost

enabled

CPU max MHz

2250

CPU min MHz

1500

RAM

Model

Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz

Configured Memory Speed

2933 MT/s

RAM Size

32x64GB (2048GB Total)

GPU

NVIDIA H100 80GB HBM3

CPU

Model

Intel(R) Xeon(R) Platinum 8480CL

Thread(s) per core

2

Socket(s)

2

Core(s) per socket

56

NUMA node(s)

2

CPU max MHz

3800

CPU min MHz

800

RAM

Model

Micron DDR5 MTC40F2046S1RC48BA1 4800MHz

Configured Memory Speed

4400 MT/s

RAM Size

32x64GB (2048GB Total)

GPU

NVIDIA L40

CPU

Model

AMD EPYC 7763 64-Core Processor

Thread(s) per core

1

Socket(s)

2

Core(s) per socket

64

NUMA node(s)

8

Frequency boost

enabled

CPU max MHz

3529

CPU min MHz

1500

RAM

Model

Samsung DDR4 M393A4K40DB3-CWE 3200MHz

Configured Memory Speed

3200 MT/s

RAM Size

16x32GB (512GB Total)

Performance Considerations#

When the server is under high load, requests might time out, as the server will not start inference for a new request until a previous request is completely generated so that the inference slot can be freed. This is done to maximize throughput for the TTS service and allow for real-time interaction.

Model Accuracy#

Riva evaluates TTS model accuracy using an automated approach that leverages Automatic Speech Recognition (ASR). The process works as follows:

  1. The TTS model generates synthetic speech from input text

  2. This generated audio is then passed through an ASR system

  3. The ASR transcription is compared with the original input text using Character Error Rate (CER)

The Character Error Rate measures the percentage of characters that differ between the original text and the ASR transcription of the synthesized speech. A lower CER indicates better TTS quality, as it means the synthesized speech was clear enough for ASR to accurately transcribe it back to the original text.

Model

Language

Dataset

CER % ⬇️

ASR model used

Magpie TTS Multilingual

English

subset of LibriTTS dev set

1.0

stt_en_conformer_transducer_large

Spanish

CML Spanish test set

1.1

whisper-large-v3

French

CML French test set

3.9

whisper-large-v3

German

CML German test set

1.26

whisper-large-v3

Magpie TTS Zeroshot

English

subset of LibriTTS dev clean set (unseen)

0.41

stt_en_conformer_transducer_large

Magpie TTS Flow

English

subset of LibriTTS dev clean set (unseen)

1.43

stt_en_conformer_transducer_large

Note

We performed metrics calculations on a subset of the dev-clean split of LibriTTS for English and the CML dataset for French and Spanish. For our analysis, we selected a subset of samples from the total available samples, ensuring that all speakers had at least five utterances of at least five seconds each. The reported metrics are the average values obtained from multiple iterations, ensuring a more efficient and reliable evaluation of the metrics.