Performance#

Evaluation Process#

This section shows the latency and throughput numbers for streaming and offline configurations of the Riva ASR service on different GPUs.

In streaming mode, the client and the server used audio chunks of the same duration. See the Results section for the chunk size value to use.

The Riva streaming client riva_streaming_asr_client, provided in the Riva image, was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing three iterations over a sample audio file (1272-135031-0000.wav) from the LibriSpeech dev-clean dataset.

You can get the source code for the riva_streaming_asr_client at Riva C++ Clients.

The following command was used to measure performance:

riva_streaming_asr_client \
   --chunk_duration_ms=<chunk_duration> \
   --simulate_realtime=true \
   --automatic_punctuation=true \
   --num_parallel_requests=<num_streams> \
   --word_time_offsets=false \
   --print_transcripts=false \
   --interim_results=false \
   --num_iterations=<3*num_streams> \
   --audio_file=1272-135031-0000.wav \
   --output_filename=/tmp/output.json

The riva_streaming_asr_client command returns the following latency measurements:

  • intermediate latency: latency of responses returned with is_final == false

  • final latency: latency of responses returned with is_final == true

  • latency: the overall latency of all returned responses. This is what is tabulated in the following tables.

The following diagrams are a schematic representation of the different latencies measured by the Riva streaming ASR client.

Schematic Diagram of Latencies Measured by Riva Streaming ASR Client

The following command was used to measure maximum throughput in offline mode:

riva_asr_client \
   --automatic_punctuation=true \
   --num_parallel_requests=32 \
   --word_time_offsets=false \
   --print_transcripts=false \
   --num_iterations=96 \
   --audio_file=1272-135031-0000x5.wav \
   --output_filename=/tmp/output.json

where 1272-135031-0000x5.wav is the 1272-135031-0000.wav audio file concatenated five times. You can get the source code for the riva_asr_client at Riva C++ Clients.

Results#

Latencies and throughput measurements for streaming and offline configurations are reported in the following tables. Throughput (duration of audio transcribed / computation time) is measured in RTFX.

Note

The values in the tables are average values over three trials. The values in the table are rounded to the last significant digit according to the standard deviation calculated on three trials. If a standard deviation is less than 0.001 of the average, then the corresponding value is rounded as if standard deviation equals 0.001 of the value.

For specifications of the hardware on which these measurements were collected, see the Hardware Specifications section.

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 270

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

14

12.7

13.4

13.7

40

0.999

8

15.3

14.3

15.5

20

45.5

7.99

16

21

18

26

28

60

15.97

32

28

28

37

40

90

31.9

48

35

35

46

47.4

100

47.8

64

42

40

54

55

130

63.7

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 1240

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

20

13.7

20

30

100

1

64

53

50

64

120

150

63.7

128

77

65

96

200

245

127

256

120

107

156

300

440

252.5

384

162

145

220

440

640

376

512

200

180

276

530

800

499

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

300

False

32

2200

True

32

170

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 160

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

18

16.7

17.4

20

40

0.999

8

21

19.8

21

30

50.4

7.99

16

29

26

40

41

70

15.96

32

41

45

51

55

110

31.9

48

53

58

71

73

160

47.75

64

65.5

72

83

85

210

63.6

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 770

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

20

18.4

20

36

100

0.999

64

80

83

95

140

200

63.7

128

119

100

155

230

296

126.8

256

190

170

270

400

530

252

384

260

245

378

580

800

374.5

512

346

330

496

850

1200

494

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

200

False

32

2000

True

32

120

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 355

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

10

9.9

11.3

12

40

1

8

12.6

12

13.4

17

31

8

16

17

15

22

25

40

15.98

32

23

23

31

33

50

31.94

48

29

28

40

41

70

47.9

64

33.6

38

45

47

70

63.9

128

49

47

64

67

150

127.6

256

84

75

107

126

391

255

Chunk size (ms): 800
Language model: n-gram
Maximum effective # of streams with n-gram language model: 1400

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

14

11

20

40

80

1

64

39

40

55

80

110

63.9

128

58

50

75

150

202

127.6

256

90

80

115

240

380

255

384

120

107

155

316

530

381.4

512

149

130

196

400

700

508

768

258

200

630

680

1280

756

1024

420

263

1280

1350

1900

992

Language model: n-gram

# of streams

Throughput (RTFX)

32

467

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

90

False

32

370

Language model: none

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

11

False

32

77

Language model: none

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

40

False

32

300

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 179

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

16

15.3

16.2

16.3

40

0.999

8

21.6

20.4

22

23

59

7.99

16

28

26.4

30

39

80

15.96

32

41.4

40

53

54

130

31.85

48

49

54

64

66

160

47.7

64

59

67

75

76

216

63.6

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 810

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

20

19.6

30

40

100

0.999

64

90

93

110

200

240

63.5

128

115

100

140

260

350

126.6

256

185

163

248

451

630

251

384

254

230

350

630

930

373

512

362

300

730

940

1550

491

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

300

False

32

2000

True

32

125

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 104

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

24

22.7

23.7

25

50

0.999

8

32.7

31

33

51

72.7

7.98

16

44

40.8

50

63

110

15.94

32

59

60

73

75

180

31.8

48

79

90

93

100

240

47.6

64

100

109

114

160

310

63.4

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 490

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

30

29.1

40

50

100

0.999

64

123

130

160

240

260

63.5

128

185

165

240

360

430

126.4

256

300

266

430

630

830

249.4

384

460

445

770

1100

1560

368

512

720

650

1400

1550

2150

483

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

180

False

32

1330

True

32

75

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 233

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

13

11.8

12.8

14

40

1

8

17.6

16.8

18.5

22

39

8

16

22.5

21.3

25

31

60.3

15.98

32

32.4

35

42

46

70

31.93

48

41

40

58

59

100

47.9

64

46

50

64

66

100

63.8

128

73

66

94

97

220

127.5

Chunk size (ms): 800
Language model: n-gram
Maximum effective # of streams with n-gram language model: 980

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

16

13

20

40

80

1

64

60

60

80

110

180

63.8

128

90

80

110

230

300

127.5

256

133.3

120

174

340

530

254

384

183

166

245

430

800

380

512

260

223

510

600

1200

505

768

535

354

1500

1640

2150

739

1024

940

600

2300

2570

2930

960

Language model: n-gram

# of streams

Throughput (RTFX)

32

460

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

60

False

32

234

Language model: none

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

5.7

False

32

38.75

Language model: none

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

24

False

32

168

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 190

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

19

18.3

19.3

20

43.5

0.999

8

24

23

30

30

65

7.98

16

31.4

29

38.3

42

80

15.96

32

42

42

57

60

100

31.9

48

52

53

69.6

75

130

47.8

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 900

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

25

22

30

50

90

0.999

64

90

90

110

160

200

63.6

128

120

100

150

240

330

126.8

256

180

160

240

400

560

251.5

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

240

False

32

2030

True

32

101.5

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 110

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

25

23

29

30

50

0.999

8

31

29

35.5

46

70

7.98

16

44

40

56

60

100

15.95

32

60

62

76

80

150

31.84

48

80

86

100

112

227

47.7

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 578

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

30

27

40

50

100

0.999

64

120

130

150

200

240

63.5

128

170

150

220

310

380

126.5

256

270

250

390

540

700

250.5

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

180

False

32

1440

True

32

94

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 280

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

11

10.3

11.2

12.4

30

1

8

20

19

26

30

42

7.99

16

28

26

35

40

56

15.97

32

35

35

48

52

73

31.9

64

50

55

66

70

100

63.8

Chunk size (ms): 800
Language model: n-gram
Maximum effective # of streams with n-gram language model: 1180

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

14

11.5

20

30

60

1

64

70

70

90

100

170

63.8

128

88

84

110

190

250

127.4

256

128

117

164

300

460

254.4

Language model: n-gram

# of streams

Throughput (RTFX)

32

440

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

70

False

32

193.5

Language model: none

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

6.2

False

32

43.3

Language model: none

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

28

False

32

192

On-Prem Hardware Specifications#

GPU

NVIDIA DGX A100 40GB

CPU

Model

AMD EPYC 7742 64-Core Processor

Thread(s) per core

2

Socket(s)

2

Core(s) per socket

64

NUMA node(s)

8

Frequency boost

enabled

CPU max MHz

2250

CPU min MHz

1500

RAM

Model

Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz

Configured Memory Speed

2933 MT/s

RAM Size

32x64GB (2048GB Total)

GPU

NVIDIA H100 80GB HBM3

CPU

Model

Intel(R) Xeon(R) Platinum 8480CL

Thread(s) per core

2

Socket(s)

2

Core(s) per socket

56

NUMA node(s)

2

CPU max MHz

3800

CPU min MHz

800

RAM

Model

Micron DDR5 MTC40F2046S1RC48BA1 4800MHz

Configured Memory Speed

4400 MT/s

RAM Size

32x64GB (2048GB Total)

GPU

NVIDIA L40

CPU

Model

AMD EPYC 7763 64-Core Processor

Thread(s) per core

1

Socket(s)

2

Core(s) per socket

64

NUMA node(s)

8

Frequency boost

enabled

CPU max MHz

3529

CPU min MHz

1500

RAM

Model

Samsung DDR4 M393A4K40DB3-CWE 3200MHz

Configured Memory Speed

3200 MT/s

RAM Size

16x32GB (512GB Total)