TTS NIM Performance#

This page provides first-chunk latency, inter-chunk latency, and throughput benchmarks for the NVIDIA TTS NIM microservice across supported GPUs.

Evaluation Process#

Benchmarks measure latency and throughput across varying numbers of parallel streams. Each stream performs 20 iterations over 10 input strings from the LJSpeech dataset, sending a new request only after receiving all audio chunks from the previous one. Three latency metrics are captured:

  • First-chunk latency: Time from request submission to receiving the first audio chunk.

  • Inter-chunk latency: Time between successive audio chunks.

  • Throughput: Measured in RTFX (duration of audio generated divided by computation time).

The following diagram shows how these latencies are measured:

Schematic Diagram of Latencies Measured by Riva Streaming TTS Client

Benchmarks use the riva_tts_perf_client provided in the Riva image. The source code is available at Riva C++ Clients.

The following command generates the results tables:

riva_tts_perf_client \
    --num_parallel_requests=<num_streams> \
    --num_iterations=<20*num_streams> \
    --online=true \
    --text_file=$test_file \
    --write_output_audio=false

The test_file is a path to the ljs_audio_text_test_filelist_small.txt file.

Results#

The following tables report first-chunk latency, inter-chunk latency, and throughput (RTFX).

Note

All values are averages over three trials, rounded to the last significant digit based on standard deviation. If a standard deviation is less than 0.001 of the average, the value is rounded as if the standard deviation equals 0.001 of the average.

For the hardware used in these measurements, refer to the Hardware Specifications section.

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

68.28

72.77

81.9

83.16

10.97

11.48

11.65

19.12

10.24

2

77.13

82.17

82.44

82.44

12.03

12.51

12.66

23.58

19.2

4

80.79

91.59

92.74

93.39

13.54

14.21

14.41

26.68

32.38

8

92.46

102.23

104.14

104.92

16.95

18.17

18.54

32.87

52.49

16

111.29

125.2

133.54

137.36

23.0

25.01

25.6

45.47

82.33

32

162.66

182.99

198.37

210.97

41.47

46.0

47.27

82.23

113.22

64

302.82

347.56

350.72

359.11

85.4

94.12

104.36

172.19

140.66

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

373

371

406

440

88

105

108

114

1.036

2

4080

7515

8111

8680

110

160

200

218

1.08

# of streams

Throughput (RTFX)

1

3.94927

2

4.22968

4

4.32184

6

4.33257

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

70.0

77.91

87.88

88.94

13.22

14.18

14.36

16.89

8.78

2

68.15

79.56

80.8

80.8

10.1

10.47

10.74

20.53

23.14

4

69.96

77.65

78.79

79.43

11.06

11.63

12.12

22.21

42.53

8

76.04

83.34

89.56

92.24

13.31

14.15

14.54

26.5

72.05

16

92.31

104.34

108.72

111.86

18.61

20.38

20.87

36.97

108.38

32

128.6

142.67

149.09

156.31

28.39

32.87

34.37

55.44

150.4

64

231.26

264.24

269.21

284.38

66.08

75.56

82.17

136.94

192.18

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

265

265

295

309

59

65

67

71

1.553

2

2740

5525

5672

6118

80

124

135

149

1.529

# of streams

Throughput (RTFX)

1

6.03871

2

6.60842

4

6.72419

6

6.74131

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

62.81

71.92

81.7

82.85

10.51

11.15

11.25

17.7

13.0

2

64.47

74.58

78.15

78.15

9.9

10.42

10.65

19.56

25.46

4

67.78

76.04

76.3

76.45

11.0

11.43

11.68

22.22

44.77

8

74.48

81.22

81.88

90.68

13.43

14.41

14.69

26.75

73.82

16

94.72

109.98

112.41

115.26

19.2

20.88

21.32

38.35

111.73

32

137.4

159.73

163.7

171.62

32.59

36.84

37.59

65.49

136.01

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

327

329

344

361

77

93

96

102

1.182

2

3967

6917

7213

8187

94

139

161

190

1.2

# of streams

Throughput (RTFX)

1

4.8313

2

5.09014

4

5.22121

6

5.28647

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

55.1

59.7

60.77

62.7

3.55

3.74

3.81

13.12

11.78

2

54.35

67.42

67.69

67.69

4.37

5.23

5.51

14.35

18.53

4

58.45

67.52

68.78

69.75

6.17

7.24

7.79

12.38

38.73

8

63.77

71.7

74.76

76.55

9.08

11.34

11.78

16.96

76.09

16

85.03

94.03

97.84

104.93

16.96

20.39

21.23

36.6

111.48

32

126.24

145.96

149.21

156.29

31.62

39.62

41.08

72.83

172.14

64

184.15

233.75

246.01

272.86

48.66

74.03

82.55

142.27

180.81

# of streams

Latency to first audio (ms)

Latency between audio chunks (ms)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

61.31

73.38

75.05

78.39

6.35

7.85

8.36

15.36

7.67

2

64.2

71.11

73.22

73.22

7.25

9.11

10.19

13.19

12.7

8

107.19

134.55

140.96

144.79

19.11

25.91

27.76

35.27

31.29

16

191.47

228.95

237.6

260.4

40.77

54.3

58.62

82.0

39.51

32

351.08

429.41

453.12

477.54

95.67

122.53

132.82

199.9

48.52

64

704.95

838.12

866.62

925.68

210.05

256.88

274.39

436.56

51.74

On-Prem Hardware Specifications#

GPU

NVIDIA DGX A100 40GB

CPU

Model

AMD EPYC 7742 64-Core Processor

Thread(s) per core

2

Socket(s)

2

Core(s) per socket

64

NUMA node(s)

8

Frequency boost

enabled

CPU max MHz

2250

CPU min MHz

1500

RAM

Model

Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz

Configured Memory Speed

2933 MT/s

RAM Size

32x64GB (2048GB Total)

GPU

NVIDIA H100 80GB HBM3

CPU

Model

Intel(R) Xeon(R) Platinum 8480CL

Thread(s) per core

2

Socket(s)

2

Core(s) per socket

56

NUMA node(s)

2

CPU max MHz

3800

CPU min MHz

800

RAM

Model

Micron DDR5 MTC40F2046S1RC48BA1 4800MHz

Configured Memory Speed

4400 MT/s

RAM Size

32x64GB (2048GB Total)

GPU

NVIDIA L40

CPU

Model

AMD EPYC 7763 64-Core Processor

Thread(s) per core

1

Socket(s)

2

Core(s) per socket

64

NUMA node(s)

8

Frequency boost

enabled

CPU max MHz

3529

CPU min MHz

1500

RAM

Model

Samsung DDR4 M393A4K40DB3-CWE 3200MHz

Configured Memory Speed

3200 MT/s

RAM Size

16x32GB (512GB Total)

Performance Considerations#

Under high load, requests can time out because the server completes the current request before starting a new one to free the inference slot. This behavior maximizes throughput and supports real-time interaction.

Model Accuracy#

TTS model accuracy is evaluated using an ASR-based round-trip approach:

  1. The TTS model generates synthetic speech from input text.

  2. An ASR system transcribes the generated audio.

  3. The ASR transcription is compared with the original input text using Character Error Rate (CER).

CER measures the percentage of characters that differ between the original text and the ASR transcription. Lower CER indicates better synthesis quality – the speech was clear enough for ASR to accurately recover the original text.

Note

Metrics are calculated on a subset of the LibriTTS dev-clean split for English and the CML dataset for French and Spanish. The subset includes only speakers with at least five utterances of at least five seconds each. Reported values are averages over multiple iterations.