Performance#

Evaluation Process#

This section shows the latency and throughput numbers for streaming and offline configurations of the Riva ASR service on different GPUs.

In streaming mode, the client and the server used audio chunks of the same duration. See the Results section for the chunk size value to use.

The Riva streaming client riva_streaming_asr_client, provided in the Riva image, was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing three iterations over a sample audio file (1272-135031-0000.wav) from the LibriSpeech dev-clean dataset.

You can get the source code for the riva_streaming_asr_client at Riva C++ Clients.

The following command was used to measure performance:

riva_streaming_asr_client \
   --chunk_duration_ms=<chunk_duration> \
   --simulate_realtime=true \
   --automatic_punctuation=true \
   --num_parallel_requests=<num_streams> \
   --word_time_offsets=false \
   --print_transcripts=false \
   --interim_results=false \
   --num_iterations=<3*num_streams> \
   --audio_file=1272-135031-0000.wav \
   --output_filename=/tmp/output.json

The riva_streaming_asr_client command returns the following latency measurements:

  • intermediate latency: latency of responses returned with is_final == false

  • final latency: latency of responses returned with is_final == true

  • latency: the overall latency of all returned responses. This is what is tabulated in the following tables.

The following diagrams are a schematic representation of the different latencies measured by the Riva streaming ASR client.

Schematic Diagram of Latencies Measured by Riva Streaming ASR Client

The following command was used to measure maximum throughput in offline mode:

riva_asr_client \
   --automatic_punctuation=true \
   --num_parallel_requests=32 \
   --word_time_offsets=false \
   --print_transcripts=false \
   --num_iterations=96 \
   --audio_file=1272-135031-0000x5.wav \
   --output_filename=/tmp/output.json

where 1272-135031-0000x5.wav is the 1272-135031-0000.wav audio file concatenated five times. You can get the source code for the riva_asr_client at Riva C++ Clients.

Results#

Latencies and throughput measurements for streaming and offline configurations are reported in the following tables. Throughput (duration of audio transcribed / computation time) is measured in RTFX.

Note

The values in the tables are average values over three trials. The values in the table are rounded to the last significant digit according to the standard deviation calculated on three trials. If a standard deviation is less than 0.001 of the average, then the corresponding value is rounded as if standard deviation equals 0.001 of the value.

For specifications of the hardware on which these measurements were collected, see the Hardware Specifications section.

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 270

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

12.439

10.388

11.524

12.54

30.242

0.99949

8

13.006

12.508

14.673

17.203

29.539

7.9929

16

18.138

17.06

24.885

27.922

49.825

15.975

32

23.093

20.141

29.905

30.991

76.264

31.915

48

28.666

29.63

33.027

34.111

101.76

47.834

64

32.012

32.449

35.855

37.779

136.06

63.719

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 1240

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

14.088

12.616

17.525

20.743

51.894

0.9994

64

42.41

37.123

43.153

157.71

163.47

63.68

128

61.41

49.455

61.86

197.73

307.18

126.82

256

93.439

67.938

98.617

315.2

558.71

251.39

384

123.79

93.576

124.35

472.36

848.59

373.93

512

166.85

117.95

318.19

615.9

1141.7

494.12

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

353.62

False

32

3707.4

True

32

170

Chunk size (ms): 320
Language model: n-gram
Maximum effective # of streams with n-gram language model: 270

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

16.946

14.233

17.459

18.954

21.387

0.99936

8

19.451

18.632

22.736

26.336

32.004

7.9924

16

24.811

23.88

28.443

32.162

42.753

15.978

32

33.166

30.537

43.692

47.19

68.485

31.929

48

44.522

49.317

57.667

60.352

93.513

47.855

64

55.794

60.906

71.632

74.103

117.78

63.755

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 1240

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

19.719

19.153

22.404

25.33

53.192

0.99929

64

67.47

73.823

84.609

89.695

91.372

63.803

128

116.36

122.78

146.72

152.17

173.31

127.31

256

174.21

179.51

223.23

242.4

270.84

253.54

384

225.73

208.54

317.68

323.05

345.72

379.42

512

281.13

299.92

406.08

416.05

517.49

503.1

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

335.83

False

32

3890.1

Chunk size (ms): 160
Language model: n-gram

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

27.7

26.0

27.7

28.5

42.8

1.0

8

33.1

32.9

35.1

35.9

54.1

8.0

16

43.4

42.7

45.7

57.2

73.9

16.0

32

59.7

48.4

78.1

80.1

105.4

31.9

48

99.1

106.5

110.5

112.2

187.5

47.7

Chunk size (ms): 960
Language model: n-gram

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

23.7

22.9

26.3

27.6

50.4

1.0

64

135.7

160.3

167.9

171.1

174.8

63.7

128

273.6

300.2

314.6

319.5

344.2

126.8

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

224.8

False

32

1223.7

Chunk size (ms): 160
Language model: n-gram

Speaker Diarization

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

False

1

34.3

33.8

35.5

36.1

61.2

1.0

True

1

38.1

34.8

48.5

50.0

93.0

1.0

False

8

41.2

40.7

43.1

43.9

76.4

8.0

True

8

53.6

41.4

89.8

96.9

165.4

8.0

False

16

52.4

51.1

53.9

70.7

104.1

15.9

True

16

70.9

51.6

114.8

129.8

257.8

15.9

False

32

78.4

64.4

102.5

105.8

145.8

31.8

True

32

115.5

102.1

201.6

217.2

394.5

31.6

False

48

105.5

124.7

132.7

136.8

174.0

47.6

True

48

169.7

141.4

258.0

287.1

518.0

47.3

Chunk size (ms): 960
Language model: n-gram

Speaker Diarization

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

False

1

62.0

61.3

64.7

65.6

92.2

1.0

True

1

93.9

76.6

100.4

101.0

538.6

1.0

False

64

230.0

269.0

275.3

278.4

280.1

63.4

True

64

388.3

425.9

495.3

510.8

525.9

63.2

False

128

366.7

398.8

416.5

429.6

446.6

126.2

True

128

600.0

621.0

644.7

900.0

957.3

124.4

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

21.0

False

32

357.2

True

32

270.1

Chunk size (ms): 160
Language model: n-gram

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

23.02

23.176

28.694

29.829

44.012

0.99863

8

34.593

33.966

40.212

45.607

94.882

7.9646

16

42.333

41.683

50.64

58.927

93.028

15.953

32

55.452

51.111

76.248

82.525

129.15

31.828

48

72.80

175.236

94.222

106.1

223.45

47.592

64

97.943

100.13

116.06

126.09

240.68

63.512

Chunk size (ms): 960
Language model: n-gram

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

29.309

29.149

36.643

41.075

47.067

0.99879

64

114.55

116.94

159.24

177.71

189.42

63.655

128

170.28

173.34

217

220.87

306.83

126.76

256

265.46

262.12

374.34

445.03

610.07

251.12

384

322.1

300.5

478.52

627.84

962.12

374.99

512

437.49

385.16

733.84

1084.8

1529.9

493.42

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

166.71

False

32

505.29

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 160

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

15.8

13.4

14.7

16.6

30.2

1.0

8

16.7

15.8

17.9

25.9

35.6

8.0

16

21.5

18.8

26.3

43.8

51.6

16.0

32

32.7

27.4

43.8

46.7

92.5

31.9

48

41.1

42.5

46.4

51.3

126.7

47.8

64

44.9

45.6

50.0

57.0

158.4

63.7

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 770

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

17.8

16.0

22.0

24.0

74.1

1.0

64

57.7

54.7

68.1

168.0

170.5

63.7

128

83.3

74.1

86.5

222.3

308.5

126.8

256

130.7

113.9

137.1

380.9

582.8

251.4

384

174.0

131.5

196.4

554.3

881.9

373.1

512

229.5

179.3

434.6

609.5

1222.9

494.1

Language model: n-gram

# of streams

Throughput (RTFX)

1

309.8

32

3008.8

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 160

Speaker Diarization

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

False

1

25.1

24.2

25.2

25.5

46.1

1.0

True

1

28.5

25.0

39.0

39.6

80.2

1.0

False

8

30.3

29.4

30.9

33.8

70.4

8.0

True

8

41.6

29.9

79.5

84.7

146.5

8.0

False

16

36.1

34.0

36.4

59.4

96.7

16.0

True

16

52.8

35.2

98.2

107.8

235.9

15.9

False

32

56.2

61.4

63.7

65.0

155.9

31.8

True

32

70.2

62.2

143.4

159.7

324.9

31.7

False

48

60.9

69.2

74.3

77.4

142.8

47.7

True

48

95.6

74.2

184.8

196.4

426.6

47.3

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 770

Speaker Diarization

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

False

1

62.7

60.6

65.6

71.0

124.3

1.0

True

1

92.4

74.9

95.4

103.6

516.6

1.0

False

64

167.4

183.9

191.6

281.4

293.9

63.4

True

64

268.4

293.6

351.6

467.3

515.0

63.0

False

128

224.3

226.0

236.8

381.3

472.8

126.2

False

256

349.9

342.1

378.2

618.6

855.6

249.3

False

384

498.1

470.4

748.8

1036.3

1407.7

367.8

False

512

694.0

579.4

1443.1

1465.2

2261.6

482.6

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

20.9

True

1

17.4

False

32

424.4

True

32

306.9

# of streams

Throughput (RTFX)

Average Latency (ms)

1

208.9

272.85

32

2210.3

745.51

64

2601

810.1

Chunk size (ms): 320
Language model: n-gram

# of streams

Throughput (RTFX)

Average Latency (ms)

1

1.0

59.60

8

7.9

122.40

16

15.8

151.56

32

31.6

193.95

64

63.0

235.30

Chunk size (ms): 1600
Language model: n-gram

# of streams

Throughput (RTFX)

Average Latency (ms)

1

1.0

64.21

64

63.4

277.25

128

126.2

343.67

256

250.4

503.87

384

371.6

640.38

512

490.6

805.25

Language model: n-gram

# of streams

Throughput (RTFX)

Average Latency (ms)

1

129.3

428.72

32

1403.7

1190.53

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 355

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

10

9.9

11.3

12

40

1

8

12.6

12

13.4

17

31

8

16

17

15

22

25

40

15.98

32

23

23

31

33

50

31.94

48

29

28

40

41

70

47.9

64

33.6

38

45

47

70

63.9

128

49

47

64

67

150

127.6

256

84

75

107

126

391

255

Chunk size (ms): 800
Language model: n-gram
Maximum effective # of streams with n-gram language model: 1400

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

14

11

20

40

80

1

64

39

40

55

80

110

63.9

128

58

50

75

150

202

127.6

256

90

80

115

240

380

255

384

120

107

155

316

530

381.4

512

149

130

196

400

700

508

768

258

200

630

680

1280

756

1024

420

263

1280

1350

1900

992

Language model: n-gram

# of streams

Throughput (RTFX)

32

467

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

90

False

32

370

Language model: none

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

11

False

32

77

Language model: none

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

40

False

32

300

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 179

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

17.79

15.812

22.171

22.66

24.527

0.99925

8

19.619

18.702

20.283

21.283

49.858

7.9866

16

24.347

22.816

24.601

30.805

83.174

15.958

32

32.883

30.65

40.314

40.992

129.39

31.856

48

43.084

44.219

50.994

56.952

210.66

47.689

64

53.643

53.416

61.031

97.948

264.43

63.476

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 810

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

20.624

19.301

26.012

28.626

50.408

0.99918

64

73.596

71.06

84.034

234.8

251.47

63.497

128

123.56

110.62

139.56

300.04

449.06

126.28

256

188.12

162.21

200.33

538

814.35

249.61

384

268.43

198.76

527.84

786.32

1372.3

369.42

512

405.24

287.28

1347.4

1439.1

2252.5

486.61

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

318.26

False

32

2085

True

32

125

Chunk size (ms): 320
Language model: n-gram
Maximum effective # of streams with n-gram language model: 179

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

25.742

25.178

29.273

30.679

40.96

0.99891

8

37.458

36.717

43.875

45.592

57.38

7.9865

16

46.788

45.738

51.555

60.965

75.74

15.963

32

64.08

57.471

84.993

89.653

128.29

31.873

48

85.545

96.194

111.54

117.86

176.5

47.714

64

93.02

104.95

116

124.89

195.03

63.61

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 810

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

21.451

20.791

23.836

24.61

53.358

0.99922

64

91.55

103.46

124.84

126.52

134.28

63.575

128

177.23

190.8

213.12

218.6

244.28

127.01

256

279.71

279.51

358.52

371.5

449.47

252.36

384

386.16

389.76

521.21

556.27

722.57

375.28

512

492.63

496.77

691.83

793.61

1101.9

494.73

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

338.78

False

32

3041.9

Chunk size (ms): 160
Language model: n-gram

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

41.4

41.0

43.1

43.9

65.2

1.0

8

69.5

68.9

73.8

76.8

116.3

8.0

16

84.7

80.4

108.1

113.5

149.9

15.9

32

138.2

147.3

172.9

180.1

232.4

31.7

48

2610.6

2456.5

4743.8

4941.9

6120.2

41.5

Chunk size (ms): 960
Language model: n-gram

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

32.0

32.0

33.2

34.9

35.6

1.0

64

263.5

310.4

324.9

326.4

342.2

63.4

128

562.0

591.7

646.0

829.8

835.5

124.8

Language model: n-gram

# of streams

Throughput (RTFX)

1

151.5

32

599.4

Chunk size (ms): 160
Language model: n-gram

Speaker Diarization

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

False

1

51.9

51.2

53.3

54.1

91.2

1.0

True

1

57.0

51.5

74.3

75.6

134.2

1.0

False

8

78.6

77.6

82.8

84.0

144.3

8.0

True

8

92.7

80.3

127.8

131.8

246.6

7.9

False

16

85.0

83.8

86.4

87.3

165.2

15.9

True

16

107.9

85.5

161.3

164.5

350.9

15.8

False

32

147.0

149.3

176.1

184.5

295.1

31.7

True

32

273.1

241.1

415.1

505.6

817.4

31.2

Chunk size (ms): 960
Language model: n-gram

Speaker Diarization

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

False

1

74.0

73.7

75.3

77.5

94.4

1.0

True

1

108.0

96.6

100.2

111.7

473.1

1.0

False

64

366.4

427.7

438.1

447.2

456.1

63.1

True

64

541.1

604.0

658.0

803.4

833.1

62.4

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

18.7

False

32

252.4

True

32

148.7

Chunk size (ms): 160
Language model: n-gram

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

30.948

31.181

34.543

36.056

47.975

0.99827

8

47.991

48.392

53.894

55.871

87.543

7.978

16

61.284

61.356

68.72

76.258

118.18

15.923

32

75.633

74.065

95.78

102.73

155.86

31.809

48

91.854

99.673

111.41

113.95

255.06

47.621

64

114.38

126.17

135.55

139.42

321.86

63.361

Chunk size (ms): 960
Language model: n-gram

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

26.371

26.905

29.871

34.808

35.852

0.9987

64

116.95

132.38

156.75

165.34

170.12

63.681

128

227.25

232.36

279.57

295.28

372.04

126.56

256

351

363.49

448.89

506.45

769.04

249.55

384

451.4

451.33

622.64

676.79

935.07

372.85

512

579.83

578.14

838.03

1041.4

1447.4

489.7

Language model: n-gram

# of streams

Throughput (RTFX)

1

213.96

32

1021

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 160

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

24.0

22.5

27.8

29.0

38.9

1.0

8

30.5

29.0

30.3

50.9

70.2

8.0

16

37.8

35.0

38.0

54.7

104.2

15.9

32

48.1

51.4

61.6

71.8

141.3

31.8

48

63.8

69.2

77.3

104.0

205.1

47.6

64

85.9

85.0

100.8

147.2

313.0

63.4

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 770

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

26.0

24.0

30.5

34.0

74.3

1.0

64

103.5

101.6

125.2

269.1

296.7

63.5

128

179.6

175.2

196.0

383.6

513.2

126.0

256

306.6

308.1

367.3

724.0

988.7

248.3

384

535.5

393.4

1469.1

1642.0

2496.4

365.0

512

1126.3

551.7

3230.1

3967.6

4614.8

476.8

512

1134.3

571.6

3422.9

3841.8

4632.6

476.7

Language model: n-gram

# of streams

Throughput (RTFX)

1

211.3

32

1395.8

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 160

Speaker Diarization

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

False

1

29.5

29.1

30.2

30.5

56.1

1.0

True

1

36.6

30.6

53.6

54.4

109.9

1.0

False

8

39.5

38.3

40.8

42.2

96.1

8.0

True

8

52.3

39.3

92.5

94.6

180.2

8.0

False

16

51.8

40.7

72.4

74.5

118.5

15.9

True

16

67.3

47.9

114.3

116.3

301.0

15.9

False

32

64.0

49.5

84.4

86.1

161.2

31.8

True

32

105.6

90.6

208.2

212.1

487.5

31.5

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 770

Speaker Diarization

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

False

1

66.1

65.6

66.9

72.3

73.0

1.0

True

1

92.3

91.5

92.7

104.1

104.9

1.0

False

64

207.2

227.4

242.7

387.8

401.4

63.2

True

64

363.7

397.5

435.2

653.1

670.7

62.8

False

128

294.3

299.9

312.1

525.3

658.5

125.5

False

256

518.9

504.9

724.3

1018.2

1668.4

245.6

False

384

867.2

683.3

2002.2

2262.0

3026.8

359.8

False

512

2194.6

2014.4

4142.6

4819.2

5894.0

443.7

False

512

2176.1

1993.4

4113.3

4797.2

5879.7

443.8

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

19.2

True

1

17.7

False

32

341.0

True

32

178.0

# of streams

Throughput (RTFX)

Average Latency (ms)

1

128.5

433.0

32

1326.0

1268.77

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 233

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

13

11.8

12.8

14

40

1

8

17.6

16.8

18.5

22

39

8

16

22.5

21.3

25

31

60.3

15.98

32

32.4

35

42

46

70

31.93

48

41

40

58

59

100

47.9

64

46

50

64

66

100

63.8

128

73

66

94

97

220

127.5

Chunk size (ms): 800
Language model: n-gram
Maximum effective # of streams with n-gram language model: 980

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

16

13

20

40

80

1

64

60

60

80

110

180

63.8

128

90

80

110

230

300

127.5

256

133.3

120

174

340

530

254

384

183

166

245

430

800

380

512

260

223

510

600

1200

505

768

535

354

1500

1640

2150

739

1024

940

600

2300

2570

2930

960

Language model: n-gram

# of streams

Throughput (RTFX)

32

460

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

60

False

32

234

Language model: none

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

5.7

False

32

38.75

Language model: none

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

24

False

32

168

Chunk size (ms): 320
Language model: n-gram

# of streams

Throughput (RTFX)

Average Latency (ms)

1

1.0

99.63

8

7.9

138.54

16

15.7

203.51

32

31.4

303.27

48

39.8

2991.17

64

50.8

3737.57

Chunk size (ms): 1600
Language model: n-gram

# of streams

Throughput (RTFX)

Average Latency (ms)

1

1.0

102.40

64

62.9

490.66

128

124.5

682.94

256

244.3

1008.00

384

313.3

3766.07

512

318.4

9788.07

Language model: n-gram

# of streams

Throughput (RTFX)

Average Latency (ms)

1

77.1

712.47

32

838.0

2027.50

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 190

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

11.518

10.501

11.753

12.329

29.332

0.99948

8

13.042

12.727

14.303

16.54

27.45

7.9934

16

17.579

16.357

25.071

26.493

42.529

15.974

32

21.415

18.903

27.705

28.62

65.338

31.924

48

32.285

32.166

34.611

35.804

102.55

47.839

64

33.933

36.076

39.682

41.26

120.46

63.75

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 900

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

14.345

12.899

18.496

21.621

49.489

0.99941

64

43.724

41.908

48.17

138.95

140.61

63.715

128

76.158

69.027

80.239

198.79

277.37

126.88

256

113.72

89.307

128.73

294.93

488.96

251.9

384

150.8

133.5

170.69

465.93

722.34

374.76

512

198.83

173.53

280.75

577.5

975.18

495.82

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

365.61

False

32

3638

True

32

101.5

Chunk size (ms): 320
Language model: n-gram
Maximum effective # of streams with n-gram language model: 270

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

13.377

12.014

14.8

15.721

17.616

0.99944

8

20.362

19.784

23.525

26.147

33.041

7.9919

16

28.97

28.08

34.588

37.939

52.757

15.97

32

42.96

38.11

55.592

57.928

94.558

31.904

48

58.84

67.281

75.958

77.311

136.62

47.794

64

79.065

88.762

99.511

109.32

181.85

63.634

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 1240

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

17.358

16.19

19.135

23.812

52.552

0.99925

64

86.236

102.67

110.26

112.32

119.81

63.754

128

204.03

205.92

220.27

223.14

250.96

126.93

256

315.08

321.68

395.18

408.04

502.56

251.93

384

423.9

421.51

577.25

664

857.63

373.82

512

573

563.93

874.58

1039.3

1263.3

492.08

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

335.32

False

32

2876.4

Chunk size (ms): 160
Language model: n-gram

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

21.6

21.1

22.8

23.4

32.5

1.0

8

31.5

31.2

33.8

35.2

51.5

8.0

16

45.5

45.4

47.8

53.5

79.2

16.0

32

67.4

61.8

89.0

90.6

119.4

31.8

48

98.9

116.6

127.1

134.2

182.2

47.6

Chunk size (ms): 960
Language model: n-gram

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

21.1

20.3

23.0

24.7

49.2

1.0

64

161.7

197.5

204.8

208.3

212.7

63.6

128

369.0

396.0

432.2

450.6

455.8

126.2

Language model: n-gram

# of streams

Throughput (RTFX)

1

264.5

32

882.6

Chunk size (ms): 160
Language model: n-gram

Speaker Diarization

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

False

1

51.9

51.2

53.3

54.1

91.2

1.0

True

1

57.0

51.5

74.3

75.6

134.2

1.0

False

8

78.6

77.6

82.8

84.0

144.3

8.0

True

8

92.7

80.3

127.8

131.8

246.6

7.9

False

16

85.0

83.8

86.4

87.3

165.2

15.9

True

16

107.9

85.5

161.3

164.5

350.9

15.8

False

32

147.0

149.3

176.1

184.5

295.1

31.7

True

32

273.1

241.1

415.1

505.6

817.4

31.2

Chunk size (ms): 960
Language model: n-gram

Speaker Diarization

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

False

1

59.8

59.2

61.8

63.2

92.3

1.0

True

1

85.1

72.2

76.2

83.5

514.8

1.0

False

64

255.4

304.4

310.3

313.2

315.0

63.4

True

64

372.3

422.1

463.4

469.2

471.6

63.1

False

128

478.7

513.0

528.6

666.1

686.7

125.3

True

128

687.7

695.9

776.2

1113.9

1620.6

123.7

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

20.7

False

32

336.4

True

32

260.5

Chunk size (ms): 160
Language model: n-gram

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

19.671

20.104

21.685

21.96

38.539

0.99884

8

31.194

31.719

35.482

36.195

66.154

7.9835

16

45.007

46.339

50.019

51.456

92.321

15.953

32

61.018

56.473

77.184

79.764

136.4

31.801

48

79.726

87.697

98.868

100.64

172.36

47.647

64

102.58

117.05

125.58

130.69

271.66

63.453

Chunk size (ms): 960
Language model: n-gram

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

24.776

24.027

30.257

32.001

63.361

0.99901

64

114.75

133.68

149.83

153.73

157.22

63.679

128

235.08

244.42

285.45

290.93

374.07

126.5

256

367.18

365.85

468.17

506.79

691.56

250.49

384

485.58

465.62

668.72

772.08

999.89

371.5

512

637.89

635.49

970.62

1132.4

1399.5

489.51

Language model: n-gram

# of streams

Throughput (RTFX)

1

180.37

32

1037.8

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 160

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

15.3

14.0

15.3

16.8

34.7

1.0

8

25.1

21.8

34.5

35.2

44.5

8.0

16

20.2

19.0

22.5

39.6

46.3

16.0

32

30.6

24.2

39.2

43.4

75.1

31.9

48

38.1

40.8

45.1

54.8

94.4

47.8

64

57.1

55.5

59.0

60.5

166.6

63.6

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 770

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

18.0

16.2

21.2

24.7

69.8

1.0

64

62.4

63.0

71.4

156.6

158.2

63.7

128

109.1

105.9

117.6

229.6

306.7

126.8

256

171.7

147.5

202.2

405.7

578.8

251.3

384

227.2

198.5

287.6

570.6

826.4

373.8

512

319.9

269.6

632.4

829.8

1471.6

492.6

Language model: n-gram

# of streams

Throughput (RTFX)

1

293.5

32

2602.0

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 160

Speaker Diarization

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

False

1

21.8

21.2

22.1

22.4

39.6

1.0

True

1

25.6

22.5

35.0

35.5

71.3

1.0

False

8

27.5

26.9

28.2

29.7

61.9

8.0

True

8

34.7

28.2

48.4

50.4

120.4

8.0

False

16

36.9

35.1

36.6

57.6

97.4

15.9

True

16

55.2

56.0

82.7

84.6

193.7

15.9

False

32

51.6

39.9

65.7

68.2

131.3

31.8

True

32

71.6

64.6

146.0

150.3

303.5

31.7

False

48

68.0

76.7

85.8

92.2

168.5

47.7

True

48

101.9

83.0

178.4

189.3

479.8

47.3

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 770

Speaker Diarization

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

False

1

55.8

55.5

56.3

61.4

61.8

1.0

True

1

70.0

69.2

70.6

79.2

80.0

1.0

False

64

175.2

194.2

197.8

273.0

312.9

63.5

True

64

263.9

292.5

313.6

458.7

465.9

63.1

False

128

252.2

262.2

273.2

397.9

471.7

126.1

False

256

438.2

419.6

492.1

819.3

1027.5

248.0

False

384

759.5

626.9

1600.5

1968.1

2799.9

364.6

False

512

2054.4

1823.7

3943.4

4720.0

5667.5

456.1

False

512

2015.4

1795.2

3924.8

4561.5

5507.2

457.1

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

21.9

True

1

19.9

False

32

420.4

True

32

308.5

Speaker Diarization

# of streams

Throughput (RTFX)

Average Latency (ms)

False

1

158.3

352.24

False

32

1631.3

1018.84

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 280

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

11

10.3

11.2

12.4

30

1

8

20

19

26

30

42

7.99

16

28

26

35

40

56

15.97

32

35

35

48

52

73

31.9

64

50

55

66

70

100

63.8

Chunk size (ms): 800
Language model: n-gram
Maximum effective # of streams with n-gram language model: 1180

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

1

14

11.5

20

30

60

1

64

70

70

90

100

170

63.8

128

88

84

110

190

250

127.4

256

128

117

164

300

460

254.4

Language model: n-gram

# of streams

Throughput (RTFX)

32

440

Language model: n-gram

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

70

False

32

193.5

Language model: none

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

6.2

False

32

43.3

Language model: none

Speaker Diarization

# of streams

Throughput (RTFX)

False

1

28

False

32

192

Chunk size (ms): 320
Language model: n-gram

# of streams

Throughput (RTFX)

Average Latency (ms)

1

1.0

86.53

8

7.9

136.34

16

15.8

163.55

32

31.4

253.70

48

44.8

991.17

64

58.9

1180.73

Chunk size (ms): 1600
Language model: n-gram

# of streams

Throughput (RTFX)

Average Latency (ms)

1

1.0

87.29

64

63.0

433.03

128

125.0

586.62

256

246.3

836.96

384

337.1

2274.47

512

342.7

7912.27

Language model: n-gram

# of streams

Throughput (RTFX)

Average Latency (ms)

1

85.2

642.87

32

1056.6

1606.57

On-Prem Hardware Specifications#

GPU

NVIDIA DGX A100 40GB

CPU

Model

AMD EPYC 7742 64-Core Processor

Thread(s) per core

2

Socket(s)

2

Core(s) per socket

64

NUMA node(s)

8

Frequency boost

enabled

CPU max MHz

2250

CPU min MHz

1500

RAM

Model

Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz

Configured Memory Speed

2933 MT/s

RAM Size

32x64GB (2048GB Total)

GPU

NVIDIA H100 80GB HBM3

CPU

Model

Intel(R) Xeon(R) Platinum 8480CL

Thread(s) per core

2

Socket(s)

2

Core(s) per socket

56

NUMA node(s)

2

CPU max MHz

3800

CPU min MHz

800

RAM

Model

Micron DDR5 MTC40F2046S1RC48BA1 4800MHz

Configured Memory Speed

4400 MT/s

RAM Size

32x64GB (2048GB Total)

GPU

NVIDIA L40

CPU

Model

AMD EPYC 7763 64-Core Processor

Thread(s) per core

1

Socket(s)

2

Core(s) per socket

64

NUMA node(s)

8

Frequency boost

enabled

CPU max MHz

3529

CPU min MHz

1500

RAM

Model

Samsung DDR4 M393A4K40DB3-CWE 3200MHz

Configured Memory Speed

3200 MT/s

RAM Size

16x32GB (512GB Total)

Model Accuracy#

Riva ASR models are evaluated using Word Error Rate (WER) for word-based languages such as English, Spanish, and French, and Character Error Rate (CER) for character-based languages such as Chinese, Japanese, and Mandarin. For Diarization, Concatenated minimum-Permutation Word Error Rate (cpWER) is used.

WER measures the minimum number of word substitutions, insertions, and deletions required to transform the model’s output into the reference transcript, divided by the total number of words in the reference. Similarly, CER calculates the minimum number of character edits needed, divided by the total number of characters in the reference. cpWER is calculated as follows:

  1. Concatenate all utterances of each speaker for both reference and hypothesis files.

  2. Compute the WER between the reference and all possible speaker permutations of the hypothesis.

  3. Pick the lowest WER among them (this is assumed to be the best permutation).

Lower WER/CER values indicate better accuracy, with 0% representing perfect transcription.

Model Name

Language

Dataset

Best latency WER (%) ⬇️

Best throughput WER (%) ⬇️

Offline WER (%) ⬇️

Parakeet 1.1b CTC

en-US

MCV 7.1 test set

10.45

8.80

7.96

en-US

LibriSpeech test-other

6.34

4.74

4.09

en-US

CallHome (CH109)

46.09

41.35

39.61

en-US (Silero VAD)

LibriSpeech test-other

5.57

4.8

4.5

en-US (Telephony)

LibriSpeech test-other

7.33

5.11

4.17

en-US (Telephony)

CallHome (CH109)

30.13

27.82

28.91

en-US (Telephony) + Sortformer Diarizer

CallHome (CH109)

28.43 (cpWER)

-

-

Parakeet 0.6b TDT

en-US

AMI

-

-

11.46

en-US

Earnings22

-

-

11.65

en-US

Gigaspeech

-

-

9.15

en-US

LibriSpeech test-clean

-

-

2.01

en-US

LibriSpeech test-other

-

-

3.51

en-US

SPGISpeech

-

-

2.16

en-US

Tedlium

-

-

3.38

en-US

Voxpopuli

-

-

6.6

Parakeet 1.1b RNNT

en-US

MCV 7.1 test set

10.74

10.54

9.77

es-US

MLS test set

7.19

5.26

3.83

es-ES

Mediaspeech

16.15

14.42

11.51

fr-FR

MLS test set

11.41

9.10

6.36

de-DE

MLS test set

11.29

9.16

7.09

ru-RU

RuLS test set

21.44

19.23

17.39

Parakeet 0.6b CTC

en-US

MCV 7.1 test set

10.57

8.87

8.45

Parakeet 0.6b CTC

vi-VN

FLEURS Vietnamese test set

10

8.58

7.97

Parakeet 0.6b CTC

zh-CN

AISHELL1 & 2

5.81

5.84

6.09

Parakeet 0.6b CTC

es-US

MLS test set

9.14

6.15

5.34

Canary 1b

en-US

MCV 7.1 test set

Not supported

Not supported

6.78

es-US

MLS test set

Not supported

Not supported

3.54

de-DE

MLS test set

Not supported

Not supported

5.18

fr-FR

MLS test set

Not supported

Not supported

4.21

ru-RU

MCV 7.0 test set

Not supported

Not supported

10.33

es-ES

Mediaspeech

Not supported

Not supported

14.40

pt-BR

MCV 10.0 test set

Not supported

Not supported

5.83

Canary 0.6b

en-US

MCV 7.1 test set

Not supported

Not supported

8.65

es-US

MLS test set

Not supported

Not supported

3.42

de-DE

MLS test set

Not supported

Not supported

5.18

fr-FR

MLS test set

Not supported

Not supported

4.66

ru-RU

MCV 7.0 test set

Not supported

Not supported

13.39

es-ES

Mediaspeech

Not supported

Not supported

13.21

pt-BR

MCV 10.0 test set

Not supported

Not supported

6.38

Conformer 120m CTC

es-US

MCV 7.1 test set

6.75

6.26

5.66