Performance

Below are measured performance for the Jarvis ASR, NLP, and TTS services on NVIDIA T4, V100 SXM2 16GB, and NVIDIA A100 SXM4 40GB GPUs. CPU specifications for each system can be found here:

ASR

The latency numbers below were measured using the streaming recognition mode, with the BERT-based punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The acoustic model used was Jasper 15x5. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Jarvis streaming client jarvis_streaming_asr_client, provided in the Jarvis client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav). The command used was:

jarvis_streaming_asr_client  \
     --chunk_duration_ms=<chunk_duration> --simulate_realtime=true \
     --automatic_punctuation=true --num_parallel_requests=<num_streams> \
     --word_time_offsets=true --print_transcripts=false \
     --interim_results=false --num_iterations=<5*num_streams> \
     --audio_file=1272-135031-0000.wav --output_filename=/tmp/output.json

The jarvis_streaming_asr_client returns latency measured in three different ways after executing the benchmark task:

  • intermediate latency: latency to return an intermediate transcript with is_final == false

  • final latency: latency of messages return with is_final == true

  • latency the overall latency of all returned message types

NVIDIA A100 GPU

100ms chunk

Acoustic model

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

quartznet

1

9.6

9.3

10.4

11.4

16.7

1

quartznet

8

15.3

15.1

17.7

19.1

30.8

8

quartznet

16

25.9

25.8

30.1

33.4

48.7

16

quartznet

32

40.8

41.5

47.4

50.1

68.5

32

quartznet

48

54.4

53.8

64.2

67.9

90.2

47.9

quartznet

64

63.3

64.2

80.5

84.8

107.4

63.8

quartznet

96

86.2

93.4

108.5

115.6

160.7

95.7

quartznet

128

132.4

135.9

176

185.5

212.6

127.5

jasper

1

13.4

13.1

14.3

15.2

20.5

1

jasper

8

17.8

17.6

20.5

22.3

34.3

8

jasper

16

26.3

24.3

34.8

36.6

47

16

jasper

32

49.9

49.6

57.4

61.8

81.1

31.9

jasper

48

60.8

61

72.3

75.5

87.6

47.9

jasper

64

72.3

75.9

87.8

90.9

118.1

63.9

jasper

96

114.5

117.7

155.3

173.1

190.4

95.7

jasper

128

258.9

240

338.2

353.2

385

127.4

800ms chunk

Acoustic Model

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

quartznet

1

14.4

14

18.1

18.5

19.2

1

quartznet

64

82.8

81.4

109

114.8

124.3

63.9

quartznet

128

143.4

148.4

187.6

199.4

211.5

127.5

quartznet

256

228.9

238.4

322.9

339.9

364.8

254.3

quartznet

384

298.4

313

406.2

444

471.3

380.6

quartznet

512

351.2

359.2

482.7

513.5

550.2

506.4

quartznet

768

467.3

472.9

645.6

684.8

732.1

757.2

quartznet

1024

630.8

607.2

961.1

1115.1

1318.1

1005.3

jasper

1

17.6

16.8

21.6

23.8

26.8

1

jasper

64

92.8

92.3

118.3

125.9

145.4

63.8

jasper

128

156.8

160.9

205.7

223.7

243.1

127.5

jasper

256

244.9

254.1

324.8

356.2

378.1

254.1

jasper

384

311.1

315.7

411.7

435.9

474.4

380.7

jasper

512

381

387.2

510.8

537.8

614.4

506.6

jasper

768

512.6

510.3

689.4

734.8

1110.5

757

jasper

1024

749.3

696.7

1228.9

1430.7

1579

1004

3200ms chunk

Acoustic model

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

quartznet

1

28.1

28.8

32.4

32.5

32.5

1

quartznet

256

356.7

397.7

478.3

493.1

518.9

253.8

quartznet

512

566.5

591.5

780.1

803.4

841.8

505.2

quartznet

768

729.1

721.9

990.8

1030.3

1074.4

753.4

quartznet

1024

899.3

937.7

1226

1315.2

1514

1000.1

quartznet

1280

1052.1

1037.9

1537.7

1793.6

2100

1244.9

quartznet

1512

1303.8

1301.7

1847.9

2149.6

2464.6

1460.2

jasper

1

31

33.4

35

35.3

35.3

1

jasper

256

422.1

451.1

548.4

568.1

583.5

253.6

jasper

512

667.5

697.5

864.8

890.7

926.3

504.1

jasper

768

865.4

898.6

1106.3

1143.5

1225.6

752.3

jasper

1024

1089

1083.8

1480.4

1617.3

2038.3

997.2

jasper

1280

1382.5

1386.3

2041.7

2380.1

2559.1

1237.2

jasper

1512

1753.8

1735

2629.3

2779.8

2970.5

1448.8

NVIDIA V100 GPU

100ms chunk

Acoustic model

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

quartznet

1

8.8

8.3

9.7

11.3

21.7

1

quartznet

8

15

14

17

20.2

43

8

quartznet

16

22.4

21.4

25.8

27.6

57.6

16

quartznet

32

36.1

36.2

41.8

44.4

72.9

31.9

quartznet

48

44.6

44.8

53

55.7

85.4

47.9

quartznet

64

54.9

55.1

67

73.1

102.5

63.8

quartznet

96

81.2

84.3

99.2

111.8

179.2

95.7

quartznet

128

114.7

109.3

157.3

181.5

228.2

127.4

jasper

1

21.5

21

22.2

24

31.2

1

jasper

8

27.6

26.5

29.7

34.7

53.4

8

jasper

16

36.9

34

49

51.3

58.8

16

jasper

32

74.5

72.5

88.1

91.6

126.3

31.9

jasper

48

117.5

101.1

175.4

186.6

224.5

47.9

jasper

64

406.4

365.7

645.5

695.1

806.5

63.6

jasper

96

14378

13737

25542

27829

32182

72.8

jasper

128

28826

28125

53029

56965

63537

66.2

800ms chunk

Acoustic model

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

quartznet

1

14.4

13.6

20.4

20.6

20.7

1.0

quartznet

64

79.3

77.2

111.3

120.2

130.1

63.8

quartznet

128

135.1

128.9

195.7

204.9

219.0

127.4

quartznet

256

222.2

218.7

315.2

339.2

362.2

254.3

quartznet

384

310.9

304.9

443.8

479.9

520.5

380.3

quartznet

512

385.2

374.5

569.0

589.6

626.2

505.4

quartznet

768

574.5

527.0

937.3

1226.6

1347.8

751.9

quartznet

1024

1088.1

946.2

1752.3

2116.6

2544.2

981.6

jasper

1

26.8

25.9

32.8

35.3

56.6

1.0

jasper

64

138.3

134.0

170.8

181.5

203.3

63.8

jasper

128

239.4

234.9

294.9

310.2

342.8

127.2

jasper

256

416.0

416.8

509.2

556.0

588.2

253.3

jasper

384

613.6

597.9

766.6

919.4

1271.1

378.0

jasper

512

969.7

858.2

1503.9

1860.3

2297.8

499.7

jasper

768

9170.1

9241.0

15868.0

16618.0

18224.0

591.1

jasper

1024

22837.0

23248.0

37553.0

40249.0

42696.0

579.8

3200ms chunk

Acoustic model

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

quartznet

1

32.933

35.423

37.712

38.012

38.012

0.9994

quartznet

256

461.44

488.88

630.67

653.84

684.75

253.1

quartznet

512

784.73

843.69

1069.8

1105.7

1154.2

501.66

quartznet

768

1121.6

1114.7

1601.7

1971.7

2138.5

747.45

quartznet

1024

1551.5

1592.9

2258.9

2463.8

2608.1

985.6

quartznet

1280

1982.2

2080.8

2910.2

3062.1

3279.6

1211.7

quartznet

1512

2305.8

2241.4

3625.4

4190.5

4989.9

1413.3

jasper

1

48.351

49.407

51.954

79.174

79.174

0.99919

jasper

256

734.99

751.2

897.03

916.36

941.26

252.12

jasper

512

1423.3

1384.4

2263.9

2387.1

2477.4

497.69

jasper

768

2190.2

2133.8

3255.7

3393

3482.7

730.15

jasper

1024

3576.3

2847.7

5861.6

6062.2

6748.6

951.97

jasper

1280

13698

12101

28644

32940

35311

1001.1

jasper

1512

19705

16730

40679

43397

46270

1014.6

NVIDIA T4

100ms chunk

Acoustic model

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

quartznet

1

19.2

18.4

21.6

23.0

38.4

1.0

quartznet

8

36.0

34.4

41.4

45.9

82.7

8.0

quartznet

16

56.4

54.8

66.0

70.6

113.9

16.0

quartznet

32

70.9

71.0

82.4

93.7

160.0

31.9

quartznet

48

99.0

96.5

128.0

152.7

210.8

47.8

quartznet

64

242.4

224.1

354.0

407.2

479.6

63.7

quartznet

96

24151.0

22486.0

42624.0

47420.0

50429.0

58.7

quartznet

128

43821.0

44736.0

77326.0

81324.0

87343.0

53.7

jasper

1

46.9

46.9

49.6

52.7

65.7

1.0

jasper

8

51.1

51.7

58.6

66.0

95.9

8.0

jasper

16

84.4

81.7

97.3

104.1

187.7

16.0

jasper

32

2328.1

2017.9

4183.5

5180.6

7012.1

31.6

jasper

48

16858.0

14761.0

32993.0

35911.0

38084.0

35.1

jasper

64

25504.0

22164.0

47484.0

51189.0

55003.0

37.0

jasper

96

38857.0

41576.0

59410.0

63763.0

69797.0

38.2

jasper

128

55384.0

57791.0

89744.0

94712.0

98622.0

38.7

800ms chunk

Acoustic model

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

quartznet

1

33.183

33.444

44.144

44.813

46.354

0.99914

quartznet

64

162.63

162.72

214.48

226.93

253.69

63.725

quartznet

128

263.6

263.68

334.96

353.4

375.9

127.11

quartznet

256

449.28

447.25

559.87

591.62

644.3

252.7

quartznet

384

732.75

682.62

986.42

1360.7

1539.3

375.95

quartznet

512

2037.5

2001.9

3136.3

3815.6

4684.4

487.93

quartznet

768

15721

15724

27569

28450

29961

493.95

quartznet

1024

29223

29487

49967

51824

53910

494.05

jasper

1

72.377

72.143

82.132

89.374

90.067

0.99848

jasper

64

259.64

262.21

298.47

311.66

331.8

63.62

jasper

128

450.81

452.22

529.64

547.49

584.69

126.62

jasper

256

1200.8

978.29

1809.4

2446.7

3595.1

249.24

jasper

384

11679

11833

19190

20312

22493

279.91

jasper

512

23750

23537

39610

41101

43670

280.41

jasper

768

46165

49046

74417

79363

83407

279.8

jasper

1024

67973

69939

114000

121000

126000

280.61

3200ms chunk

Acoustic model

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

quartznet

1

157.62

160.64

168.29

168.31

168.31

0.99726

quartznet

256

906.17

915.19

1098.4

1130.8

1163.2

251.35

quartznet

512

1515.2

1491.2

2244.4

2429.9

2540.8

494.82

quartznet

768

2398.4

2216.6

3447

3586.4

3909.8

722.55

quartznet

1024

4636.2

4727.7

7782.6

8737.9

8969.3

926.66

quartznet

1280

17263

15966

36103

40196

44408

872.88

quartznet

1512

25038

24528

49704

56065

60136

875.68

jasper

1

96.201

100.64

104.75

104.82

104.82

0.99831

jasper

256

1758.4

1668.3

2718.5

2764.3

2811.6

247.1

jasper

512

11593

9623.5

25483

28937

30681

432.78

jasper

768

28073

27499

55288

57262

63169

440.06

jasper

1024

44405

44756

83588

86835

92653

445.39

jasper

1280

61336

65536

114000

117000

126000

446.78

jasper

1512

76306

83556

140000

145000

153000

447.83

NLP

Performance of the Jarvis named entity recognition (NER) service (using a BERT-base model, sequence length of 128) and the Jarvis question answering (QA) service (using a BERT-large model, sequence length of 384) was measured in Jarvis. Batch size 1 latency and maximum throughput were measured.

NVIDIA A100 GPU

Task

# of streams

Latency (ms)

Throughput (seq/s)

avg

p50

p90

p95

p99

NER

1

3.19

3.15

3.3

3.44

3.88

311.1

NER

256

95.5

96.1

108

113

118

2548.8

Q&A

1

4.95

4.83

5.25

5.36

5.77

201.2

Q&A

128

279

290

294

308

321

453.1

NVIDIA V100 GPU

Task

# of streams

Latency (ms)

Throughput (seq/s)

avg

p50

p90

p95

p99

NER

1

4.87

4.84

5.07

5.11

5.29

204.2

NER

256

135

135

154

160

164

1796.8

Q&A

1

7.47

7.44

7.58

7.62

7.78

133.5

Q&A

128

521

541

543

544

626

243.8

NVIDIA T4

Task

# of streams

Latency (ms)

Throughput (seq/s)

avg

p50

p90

p95

p99

NER

1

9.31

9.19

9.94

10.2

11.1

106.7

NER

256

255

265

282

285

289

960.2

Q&A

1

11.5

11.3

11.4

11.4

11.5

86.9

Q&A

128

571

582

672

684

768

223.1

TTS

Performance of the Jarvis text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured.

NVIDIA A100 GPU

# of streams

Latency to first audio (s)

Latency between audio chunks (s)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

0.06

0.06

0.06

0.06

0.04

0.04

0.04

0.04

19.5

4

0.48

0.67

0.71

0.78

0.03

0.05

0.06

0.11

37.0

6

0.69

0.89

0.94

1.06

0.03

0.05

0.07

0.10

41.8

8

0.88

1.10

1.15

1.25

0.03

0.06

0.07

0.10

45.8

10

1.06

1.21

1.26

1.43

0.03

0.06

0.08

0.09

48.7

NVIDIA V100 GPU

# of streams

Latency to first audio (s)

Latency between audio chunks (s)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

0.08

0.08

0.08

0.25

0.05

0.06

0.06

0.06

14.31

4

0.77

0.98

1.07

1.19

0.05

0.07

0.08

0.13

23.3

6

1.11

1.47

1.56

1.71

0.05

0.09

0.11

0.17

25.55

8

1.4

1.81

1.9

2.06

0.06

0.1

0.12

0.17

28.09

10

1.74

2.37

2.52

2.78

0.07

0.12

0.14

0.17

27.75

NVIDIA T4

# of streams

Latency to first audio (s)

Latency between audio chunks (s)

Throughput (RTFX)

avg

p90

p95

p99

avg

p90

p95

p99

1

0.12

0.12

0.12

0.12

0.07

0.07

0.07

0.07

11.17

4

1.02

1.37

1.43

1.52

0.07

0.11

0.13

0.19

17.14

6

1.59

2.05

2.15

2.32

0.07

0.12

0.15

0.25

18.16

8

2.13

2.59

2.71

2.88

0.08

0.14

0.18

0.26

18.83

10

2.55

3.42

3.65

4.03

0.1

0.2

0.24

0.34

18.37

When the server is under high load, requests might time out, as the server will not start inference for a new request until a previous request is completely generated so that inference slot can be freed. This is done to maximize throughput for the TTS service and allow for real-time interaction. NVIDIA does not recommend making more than 8-10 simultaneous requests with the models provided in Jarvis 1.0.0 beta.