Performance¶
Below are measured performance for the Riva ASR, NLP, and TTS services on NVIDIA T4, V100 SXM2 16GB, and NVIDIA A100 SXM4 40GB GPUs. CPU specifications for each system can be found here:
ASR¶
The latency numbers below were measured using the streaming recognition mode, with the
BERT-based punctuation model enabled, a 4-gram language model, a decoder beam width of
128, and timestamps enabled. The acoustic model used was Jasper 15x5. The client and the
server used audio chunks of the same duration (100ms, 800ms, 3200ms depending on
the server configuration). The Riva streaming client riva_streaming_asr_client
,
provided in the Riva client image, was used with the --simulate_realtime
flag to
simulate transcription from a microphone, where each stream was doing 5 iterations
over a sample audio file from the Librispeech dataset (1272-135031-0000.wav).
The command used was:
riva_streaming_asr_client \
--chunk_duration_ms=<chunk_duration> --simulate_realtime=true \
--automatic_punctuation=true --num_parallel_requests=<num_streams> \
--word_time_offsets=true --print_transcripts=false \
--interim_results=false --num_iterations=<5*num_streams> \
--audio_file=1272-135031-0000.wav --output_filename=/tmp/output.json
The riva_streaming_asr_client
returns latency measured in three different ways
after executing the benchmark task:
intermediate latency
: latency to return an intermediate transcript withis_final == false
final latency
: latency of messages return withis_final == true
latency
: the overall latency of all returned message types
NVIDIA A100 GPU¶
100ms chunk¶
Acoustic model |
# of streams |
Latency (ms) |
Throughput (RTFX) |
||||
---|---|---|---|---|---|---|---|
avg |
p50 |
p90 |
p95 |
p99 |
|||
quartznet |
1 |
9.6 |
9.3 |
10.4 |
11.4 |
16.7 |
1 |
quartznet |
8 |
15.3 |
15.1 |
17.7 |
19.1 |
30.8 |
8 |
quartznet |
16 |
25.9 |
25.8 |
30.1 |
33.4 |
48.7 |
16 |
quartznet |
32 |
40.8 |
41.5 |
47.4 |
50.1 |
68.5 |
32 |
quartznet |
48 |
54.4 |
53.8 |
64.2 |
67.9 |
90.2 |
47.9 |
quartznet |
64 |
63.3 |
64.2 |
80.5 |
84.8 |
107.4 |
63.8 |
quartznet |
96 |
86.2 |
93.4 |
108.5 |
115.6 |
160.7 |
95.7 |
quartznet |
128 |
132.4 |
135.9 |
176 |
185.5 |
212.6 |
127.5 |
jasper |
1 |
13.4 |
13.1 |
14.3 |
15.2 |
20.5 |
1 |
jasper |
8 |
17.8 |
17.6 |
20.5 |
22.3 |
34.3 |
8 |
jasper |
16 |
26.3 |
24.3 |
34.8 |
36.6 |
47 |
16 |
jasper |
32 |
49.9 |
49.6 |
57.4 |
61.8 |
81.1 |
31.9 |
jasper |
48 |
60.8 |
61 |
72.3 |
75.5 |
87.6 |
47.9 |
jasper |
64 |
72.3 |
75.9 |
87.8 |
90.9 |
118.1 |
63.9 |
jasper |
96 |
114.5 |
117.7 |
155.3 |
173.1 |
190.4 |
95.7 |
jasper |
128 |
258.9 |
240 |
338.2 |
353.2 |
385 |
127.4 |
800ms chunk¶
Acoustic Model |
# of streams |
Latency (ms) |
Throughput (RTFX) |
||||
---|---|---|---|---|---|---|---|
avg |
p50 |
p90 |
p95 |
p99 |
|||
quartznet |
1 |
14.4 |
14 |
18.1 |
18.5 |
19.2 |
1 |
quartznet |
64 |
82.8 |
81.4 |
109 |
114.8 |
124.3 |
63.9 |
quartznet |
128 |
143.4 |
148.4 |
187.6 |
199.4 |
211.5 |
127.5 |
quartznet |
256 |
228.9 |
238.4 |
322.9 |
339.9 |
364.8 |
254.3 |
quartznet |
384 |
298.4 |
313 |
406.2 |
444 |
471.3 |
380.6 |
quartznet |
512 |
351.2 |
359.2 |
482.7 |
513.5 |
550.2 |
506.4 |
quartznet |
768 |
467.3 |
472.9 |
645.6 |
684.8 |
732.1 |
757.2 |
quartznet |
1024 |
630.8 |
607.2 |
961.1 |
1115.1 |
1318.1 |
1005.3 |
jasper |
1 |
17.6 |
16.8 |
21.6 |
23.8 |
26.8 |
1 |
jasper |
64 |
92.8 |
92.3 |
118.3 |
125.9 |
145.4 |
63.8 |
jasper |
128 |
156.8 |
160.9 |
205.7 |
223.7 |
243.1 |
127.5 |
jasper |
256 |
244.9 |
254.1 |
324.8 |
356.2 |
378.1 |
254.1 |
jasper |
384 |
311.1 |
315.7 |
411.7 |
435.9 |
474.4 |
380.7 |
jasper |
512 |
381 |
387.2 |
510.8 |
537.8 |
614.4 |
506.6 |
jasper |
768 |
512.6 |
510.3 |
689.4 |
734.8 |
1110.5 |
757 |
jasper |
1024 |
749.3 |
696.7 |
1228.9 |
1430.7 |
1579 |
1004 |
3200ms chunk¶
Acoustic model |
# of streams |
Latency (ms) |
Throughput (RTFX) |
||||
---|---|---|---|---|---|---|---|
avg |
p50 |
p90 |
p95 |
p99 |
|||
quartznet |
1 |
28.1 |
28.8 |
32.4 |
32.5 |
32.5 |
1 |
quartznet |
256 |
356.7 |
397.7 |
478.3 |
493.1 |
518.9 |
253.8 |
quartznet |
512 |
566.5 |
591.5 |
780.1 |
803.4 |
841.8 |
505.2 |
quartznet |
768 |
729.1 |
721.9 |
990.8 |
1030.3 |
1074.4 |
753.4 |
quartznet |
1024 |
899.3 |
937.7 |
1226 |
1315.2 |
1514 |
1000.1 |
quartznet |
1280 |
1052.1 |
1037.9 |
1537.7 |
1793.6 |
2100 |
1244.9 |
quartznet |
1512 |
1303.8 |
1301.7 |
1847.9 |
2149.6 |
2464.6 |
1460.2 |
jasper |
1 |
31 |
33.4 |
35 |
35.3 |
35.3 |
1 |
jasper |
256 |
422.1 |
451.1 |
548.4 |
568.1 |
583.5 |
253.6 |
jasper |
512 |
667.5 |
697.5 |
864.8 |
890.7 |
926.3 |
504.1 |
jasper |
768 |
865.4 |
898.6 |
1106.3 |
1143.5 |
1225.6 |
752.3 |
jasper |
1024 |
1089 |
1083.8 |
1480.4 |
1617.3 |
2038.3 |
997.2 |
jasper |
1280 |
1382.5 |
1386.3 |
2041.7 |
2380.1 |
2559.1 |
1237.2 |
jasper |
1512 |
1753.8 |
1735 |
2629.3 |
2779.8 |
2970.5 |
1448.8 |
NVIDIA V100 GPU¶
100ms chunk¶
Acoustic model |
# of streams |
Latency (ms) |
Throughput (RTFX) |
||||
---|---|---|---|---|---|---|---|
avg |
p50 |
p90 |
p95 |
p99 |
|||
quartznet |
1 |
8.8 |
8.3 |
9.7 |
11.3 |
21.7 |
1 |
quartznet |
8 |
15 |
14 |
17 |
20.2 |
43 |
8 |
quartznet |
16 |
22.4 |
21.4 |
25.8 |
27.6 |
57.6 |
16 |
quartznet |
32 |
36.1 |
36.2 |
41.8 |
44.4 |
72.9 |
31.9 |
quartznet |
48 |
44.6 |
44.8 |
53 |
55.7 |
85.4 |
47.9 |
quartznet |
64 |
54.9 |
55.1 |
67 |
73.1 |
102.5 |
63.8 |
quartznet |
96 |
81.2 |
84.3 |
99.2 |
111.8 |
179.2 |
95.7 |
quartznet |
128 |
114.7 |
109.3 |
157.3 |
181.5 |
228.2 |
127.4 |
jasper |
1 |
21.5 |
21 |
22.2 |
24 |
31.2 |
1 |
jasper |
8 |
27.6 |
26.5 |
29.7 |
34.7 |
53.4 |
8 |
jasper |
16 |
36.9 |
34 |
49 |
51.3 |
58.8 |
16 |
jasper |
32 |
74.5 |
72.5 |
88.1 |
91.6 |
126.3 |
31.9 |
jasper |
48 |
117.5 |
101.1 |
175.4 |
186.6 |
224.5 |
47.9 |
jasper |
64 |
406.4 |
365.7 |
645.5 |
695.1 |
806.5 |
63.6 |
jasper |
96 |
14378 |
13737 |
25542 |
27829 |
32182 |
72.8 |
jasper |
128 |
28826 |
28125 |
53029 |
56965 |
63537 |
66.2 |
800ms chunk¶
Acoustic model |
# of streams |
Latency (ms) |
Throughput (RTFX) |
||||
---|---|---|---|---|---|---|---|
avg |
p50 |
p90 |
p95 |
p99 |
|||
quartznet |
1 |
14.4 |
13.6 |
20.4 |
20.6 |
20.7 |
1.0 |
quartznet |
64 |
79.3 |
77.2 |
111.3 |
120.2 |
130.1 |
63.8 |
quartznet |
128 |
135.1 |
128.9 |
195.7 |
204.9 |
219.0 |
127.4 |
quartznet |
256 |
222.2 |
218.7 |
315.2 |
339.2 |
362.2 |
254.3 |
quartznet |
384 |
310.9 |
304.9 |
443.8 |
479.9 |
520.5 |
380.3 |
quartznet |
512 |
385.2 |
374.5 |
569.0 |
589.6 |
626.2 |
505.4 |
quartznet |
768 |
574.5 |
527.0 |
937.3 |
1226.6 |
1347.8 |
751.9 |
quartznet |
1024 |
1088.1 |
946.2 |
1752.3 |
2116.6 |
2544.2 |
981.6 |
jasper |
1 |
26.8 |
25.9 |
32.8 |
35.3 |
56.6 |
1.0 |
jasper |
64 |
138.3 |
134.0 |
170.8 |
181.5 |
203.3 |
63.8 |
jasper |
128 |
239.4 |
234.9 |
294.9 |
310.2 |
342.8 |
127.2 |
jasper |
256 |
416.0 |
416.8 |
509.2 |
556.0 |
588.2 |
253.3 |
jasper |
384 |
613.6 |
597.9 |
766.6 |
919.4 |
1271.1 |
378.0 |
jasper |
512 |
969.7 |
858.2 |
1503.9 |
1860.3 |
2297.8 |
499.7 |
jasper |
768 |
9170.1 |
9241.0 |
15868.0 |
16618.0 |
18224.0 |
591.1 |
jasper |
1024 |
22837.0 |
23248.0 |
37553.0 |
40249.0 |
42696.0 |
579.8 |
3200ms chunk¶
Acoustic model |
# of streams |
Latency (ms) |
Throughput (RTFX) |
||||
---|---|---|---|---|---|---|---|
avg |
p50 |
p90 |
p95 |
p99 |
|||
quartznet |
1 |
32.933 |
35.423 |
37.712 |
38.012 |
38.012 |
0.9994 |
quartznet |
256 |
461.44 |
488.88 |
630.67 |
653.84 |
684.75 |
253.1 |
quartznet |
512 |
784.73 |
843.69 |
1069.8 |
1105.7 |
1154.2 |
501.66 |
quartznet |
768 |
1121.6 |
1114.7 |
1601.7 |
1971.7 |
2138.5 |
747.45 |
quartznet |
1024 |
1551.5 |
1592.9 |
2258.9 |
2463.8 |
2608.1 |
985.6 |
quartznet |
1280 |
1982.2 |
2080.8 |
2910.2 |
3062.1 |
3279.6 |
1211.7 |
quartznet |
1512 |
2305.8 |
2241.4 |
3625.4 |
4190.5 |
4989.9 |
1413.3 |
jasper |
1 |
48.351 |
49.407 |
51.954 |
79.174 |
79.174 |
0.99919 |
jasper |
256 |
734.99 |
751.2 |
897.03 |
916.36 |
941.26 |
252.12 |
jasper |
512 |
1423.3 |
1384.4 |
2263.9 |
2387.1 |
2477.4 |
497.69 |
jasper |
768 |
2190.2 |
2133.8 |
3255.7 |
3393 |
3482.7 |
730.15 |
jasper |
1024 |
3576.3 |
2847.7 |
5861.6 |
6062.2 |
6748.6 |
951.97 |
jasper |
1280 |
13698 |
12101 |
28644 |
32940 |
35311 |
1001.1 |
jasper |
1512 |
19705 |
16730 |
40679 |
43397 |
46270 |
1014.6 |
NVIDIA T4¶
100ms chunk¶
Acoustic model |
# of streams |
Latency (ms) |
Throughput (RTFX) |
||||
---|---|---|---|---|---|---|---|
avg |
p50 |
p90 |
p95 |
p99 |
|||
quartznet |
1 |
19.2 |
18.4 |
21.6 |
23.0 |
38.4 |
1.0 |
quartznet |
8 |
36.0 |
34.4 |
41.4 |
45.9 |
82.7 |
8.0 |
quartznet |
16 |
56.4 |
54.8 |
66.0 |
70.6 |
113.9 |
16.0 |
quartznet |
32 |
70.9 |
71.0 |
82.4 |
93.7 |
160.0 |
31.9 |
quartznet |
48 |
99.0 |
96.5 |
128.0 |
152.7 |
210.8 |
47.8 |
quartznet |
64 |
242.4 |
224.1 |
354.0 |
407.2 |
479.6 |
63.7 |
quartznet |
96 |
24151.0 |
22486.0 |
42624.0 |
47420.0 |
50429.0 |
58.7 |
quartznet |
128 |
43821.0 |
44736.0 |
77326.0 |
81324.0 |
87343.0 |
53.7 |
jasper |
1 |
46.9 |
46.9 |
49.6 |
52.7 |
65.7 |
1.0 |
jasper |
8 |
51.1 |
51.7 |
58.6 |
66.0 |
95.9 |
8.0 |
jasper |
16 |
84.4 |
81.7 |
97.3 |
104.1 |
187.7 |
16.0 |
jasper |
32 |
2328.1 |
2017.9 |
4183.5 |
5180.6 |
7012.1 |
31.6 |
jasper |
48 |
16858.0 |
14761.0 |
32993.0 |
35911.0 |
38084.0 |
35.1 |
jasper |
64 |
25504.0 |
22164.0 |
47484.0 |
51189.0 |
55003.0 |
37.0 |
jasper |
96 |
38857.0 |
41576.0 |
59410.0 |
63763.0 |
69797.0 |
38.2 |
jasper |
128 |
55384.0 |
57791.0 |
89744.0 |
94712.0 |
98622.0 |
38.7 |
800ms chunk¶
Acoustic model |
# of streams |
Latency (ms) |
Throughput (RTFX) |
||||
---|---|---|---|---|---|---|---|
avg |
p50 |
p90 |
p95 |
p99 |
|||
quartznet |
1 |
33.183 |
33.444 |
44.144 |
44.813 |
46.354 |
0.99914 |
quartznet |
64 |
162.63 |
162.72 |
214.48 |
226.93 |
253.69 |
63.725 |
quartznet |
128 |
263.6 |
263.68 |
334.96 |
353.4 |
375.9 |
127.11 |
quartznet |
256 |
449.28 |
447.25 |
559.87 |
591.62 |
644.3 |
252.7 |
quartznet |
384 |
732.75 |
682.62 |
986.42 |
1360.7 |
1539.3 |
375.95 |
quartznet |
512 |
2037.5 |
2001.9 |
3136.3 |
3815.6 |
4684.4 |
487.93 |
quartznet |
768 |
15721 |
15724 |
27569 |
28450 |
29961 |
493.95 |
quartznet |
1024 |
29223 |
29487 |
49967 |
51824 |
53910 |
494.05 |
jasper |
1 |
72.377 |
72.143 |
82.132 |
89.374 |
90.067 |
0.99848 |
jasper |
64 |
259.64 |
262.21 |
298.47 |
311.66 |
331.8 |
63.62 |
jasper |
128 |
450.81 |
452.22 |
529.64 |
547.49 |
584.69 |
126.62 |
jasper |
256 |
1200.8 |
978.29 |
1809.4 |
2446.7 |
3595.1 |
249.24 |
jasper |
384 |
11679 |
11833 |
19190 |
20312 |
22493 |
279.91 |
jasper |
512 |
23750 |
23537 |
39610 |
41101 |
43670 |
280.41 |
jasper |
768 |
46165 |
49046 |
74417 |
79363 |
83407 |
279.8 |
jasper |
1024 |
67973 |
69939 |
114000 |
121000 |
126000 |
280.61 |
3200ms chunk¶
Acoustic model |
# of streams |
Latency (ms) |
Throughput (RTFX) |
||||
---|---|---|---|---|---|---|---|
avg |
p50 |
p90 |
p95 |
p99 |
|||
quartznet |
1 |
157.62 |
160.64 |
168.29 |
168.31 |
168.31 |
0.99726 |
quartznet |
256 |
906.17 |
915.19 |
1098.4 |
1130.8 |
1163.2 |
251.35 |
quartznet |
512 |
1515.2 |
1491.2 |
2244.4 |
2429.9 |
2540.8 |
494.82 |
quartznet |
768 |
2398.4 |
2216.6 |
3447 |
3586.4 |
3909.8 |
722.55 |
quartznet |
1024 |
4636.2 |
4727.7 |
7782.6 |
8737.9 |
8969.3 |
926.66 |
quartznet |
1280 |
17263 |
15966 |
36103 |
40196 |
44408 |
872.88 |
quartznet |
1512 |
25038 |
24528 |
49704 |
56065 |
60136 |
875.68 |
jasper |
1 |
96.201 |
100.64 |
104.75 |
104.82 |
104.82 |
0.99831 |
jasper |
256 |
1758.4 |
1668.3 |
2718.5 |
2764.3 |
2811.6 |
247.1 |
jasper |
512 |
11593 |
9623.5 |
25483 |
28937 |
30681 |
432.78 |
jasper |
768 |
28073 |
27499 |
55288 |
57262 |
63169 |
440.06 |
jasper |
1024 |
44405 |
44756 |
83588 |
86835 |
92653 |
445.39 |
jasper |
1280 |
61336 |
65536 |
114000 |
117000 |
126000 |
446.78 |
jasper |
1512 |
76306 |
83556 |
140000 |
145000 |
153000 |
447.83 |
NLP¶
Performance of the Riva named entity recognition (NER) service (using a BERT-base model, sequence length of 128) and the Riva Question Answering (QA) service (using a BERT-large model, sequence length of 384) was measured in Riva. Batch size 1 latency and maximum throughput were measured.
NVIDIA A100 GPU¶
Task |
# of streams |
Latency (ms) |
Throughput (seq/s) |
||||
---|---|---|---|---|---|---|---|
avg |
p50 |
p90 |
p95 |
p99 |
|||
NER |
1 |
3.19 |
3.15 |
3.3 |
3.44 |
3.88 |
311.1 |
NER |
256 |
95.5 |
96.1 |
108 |
113 |
118 |
2548.8 |
Q&A |
1 |
4.95 |
4.83 |
5.25 |
5.36 |
5.77 |
201.2 |
Q&A |
128 |
279 |
290 |
294 |
308 |
321 |
453.1 |
NVIDIA V100 GPU¶
Task |
# of streams |
Latency (ms) |
Throughput (seq/s) |
||||
---|---|---|---|---|---|---|---|
avg |
p50 |
p90 |
p95 |
p99 |
|||
NER |
1 |
4.87 |
4.84 |
5.07 |
5.11 |
5.29 |
204.2 |
NER |
256 |
135 |
135 |
154 |
160 |
164 |
1796.8 |
Q&A |
1 |
7.47 |
7.44 |
7.58 |
7.62 |
7.78 |
133.5 |
Q&A |
128 |
521 |
541 |
543 |
544 |
626 |
243.8 |
NVIDIA T4¶
Task |
# of streams |
Latency (ms) |
Throughput (seq/s) |
||||
---|---|---|---|---|---|---|---|
avg |
p50 |
p90 |
p95 |
p99 |
|||
NER |
1 |
9.31 |
9.19 |
9.94 |
10.2 |
11.1 |
106.7 |
NER |
256 |
255 |
265 |
282 |
285 |
289 |
960.2 |
Q&A |
1 |
11.5 |
11.3 |
11.4 |
11.4 |
11.5 |
86.9 |
Q&A |
128 |
571 |
582 |
672 |
684 |
768 |
223.1 |
TTS¶
Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured.
NVIDIA A100 GPU¶
# of streams |
Latency to first audio (s) |
Latency between audio chunks (s) |
Throughput (RTFX) |
||||||
---|---|---|---|---|---|---|---|---|---|
avg |
p90 |
p95 |
p99 |
avg |
p90 |
p95 |
p99 |
||
1 |
0.06 |
0.06 |
0.06 |
0.06 |
0.04 |
0.04 |
0.04 |
0.04 |
19.5 |
4 |
0.48 |
0.67 |
0.71 |
0.78 |
0.03 |
0.05 |
0.06 |
0.11 |
37.0 |
6 |
0.69 |
0.89 |
0.94 |
1.06 |
0.03 |
0.05 |
0.07 |
0.10 |
41.8 |
8 |
0.88 |
1.10 |
1.15 |
1.25 |
0.03 |
0.06 |
0.07 |
0.10 |
45.8 |
10 |
1.06 |
1.21 |
1.26 |
1.43 |
0.03 |
0.06 |
0.08 |
0.09 |
48.7 |
NVIDIA V100 GPU¶
# of streams |
Latency to first audio (s) |
Latency between audio chunks (s) |
Throughput (RTFX) |
||||||
---|---|---|---|---|---|---|---|---|---|
avg |
p90 |
p95 |
p99 |
avg |
p90 |
p95 |
p99 |
||
1 |
0.08 |
0.08 |
0.08 |
0.25 |
0.05 |
0.06 |
0.06 |
0.06 |
14.31 |
4 |
0.77 |
0.98 |
1.07 |
1.19 |
0.05 |
0.07 |
0.08 |
0.13 |
23.3 |
6 |
1.11 |
1.47 |
1.56 |
1.71 |
0.05 |
0.09 |
0.11 |
0.17 |
25.55 |
8 |
1.4 |
1.81 |
1.9 |
2.06 |
0.06 |
0.1 |
0.12 |
0.17 |
28.09 |
10 |
1.74 |
2.37 |
2.52 |
2.78 |
0.07 |
0.12 |
0.14 |
0.17 |
27.75 |
NVIDIA T4¶
# of streams |
Latency to first audio (s) |
Latency between audio chunks (s) |
Throughput (RTFX) |
||||||
---|---|---|---|---|---|---|---|---|---|
avg |
p90 |
p95 |
p99 |
avg |
p90 |
p95 |
p99 |
||
1 |
0.12 |
0.12 |
0.12 |
0.12 |
0.07 |
0.07 |
0.07 |
0.07 |
11.17 |
4 |
1.02 |
1.37 |
1.43 |
1.52 |
0.07 |
0.11 |
0.13 |
0.19 |
17.14 |
6 |
1.59 |
2.05 |
2.15 |
2.32 |
0.07 |
0.12 |
0.15 |
0.25 |
18.16 |
8 |
2.13 |
2.59 |
2.71 |
2.88 |
0.08 |
0.14 |
0.18 |
0.26 |
18.83 |
10 |
2.55 |
3.42 |
3.65 |
4.03 |
0.1 |
0.2 |
0.24 |
0.34 |
18.37 |
When the server is under high load, requests might time out, as the server will not start inference for a new request until a previous request is completely generated so that inference slot can be freed. This is done to maximize throughput for the TTS service and allow for real-time interaction. NVIDIA does not recommend making more than 8-10 simultaneous requests with the models provided in Riva 1.0.0 beta.