Performance¶

Below are measured performance for the Riva ASR, NLP, and TTS services on NVIDIA T4, V100 SXM2 16GB, and NVIDIA A100 SXM4 40GB GPUs. CPU specifications for each system can be found here:

ASR¶

The latency numbers below were measured using the streaming recognition mode, with the BERT-based punctuation model enabled, a 4-gram language model, a decoder beam width of 128, and timestamps enabled. The acoustic model used was Jasper 15x5. The client and the server used audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client riva_streaming_asr_client, provided in the Riva client image, was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav). The command used was:

riva_streaming_asr_client  \
     --chunk_duration_ms=<chunk_duration> --simulate_realtime=true \
     --automatic_punctuation=true --num_parallel_requests=<num_streams> \
     --word_time_offsets=true --print_transcripts=false \
     --interim_results=false --num_iterations=<5*num_streams> \
     --audio_file=1272-135031-0000.wav --output_filename=/tmp/output.json

The riva_streaming_asr_client returns latency measured in three different ways after executing the benchmark task:

intermediate latency: latency to return an intermediate transcript with is_final == false

final latency: latency of messages return with is_final == true

latency: the overall latency of all returned message types

NVIDIA A100 GPU¶

100ms chunk¶

Acoustic model	# of streams	Latency (ms)					Throughput (RTFX)
		avg	p50	p90	p95	p99
quartznet	1	9.6	9.3	10.4	11.4	16.7	1
quartznet	8	15.3	15.1	17.7	19.1	30.8	8
quartznet	16	25.9	25.8	30.1	33.4	48.7	16
quartznet	32	40.8	41.5	47.4	50.1	68.5	32
quartznet	48	54.4	53.8	64.2	67.9	90.2	47.9
quartznet	64	63.3	64.2	80.5	84.8	107.4	63.8
quartznet	96	86.2	93.4	108.5	115.6	160.7	95.7
quartznet	128	132.4	135.9	176	185.5	212.6	127.5
jasper	1	13.4	13.1	14.3	15.2	20.5	1
jasper	8	17.8	17.6	20.5	22.3	34.3	8
jasper	16	26.3	24.3	34.8	36.6	47	16
jasper	32	49.9	49.6	57.4	61.8	81.1	31.9
jasper	48	60.8	61	72.3	75.5	87.6	47.9
jasper	64	72.3	75.9	87.8	90.9	118.1	63.9
jasper	96	114.5	117.7	155.3	173.1	190.4	95.7
jasper	128	258.9	240	338.2	353.2	385	127.4

800ms chunk¶

Acoustic Model	# of streams	Latency (ms)					Throughput (RTFX)
		avg	p50	p90	p95	p99
quartznet	1	14.4	14	18.1	18.5	19.2	1
quartznet	64	82.8	81.4	109	114.8	124.3	63.9
quartznet	128	143.4	148.4	187.6	199.4	211.5	127.5
quartznet	256	228.9	238.4	322.9	339.9	364.8	254.3
quartznet	384	298.4	313	406.2	444	471.3	380.6
quartznet	512	351.2	359.2	482.7	513.5	550.2	506.4
quartznet	768	467.3	472.9	645.6	684.8	732.1	757.2
quartznet	1024	630.8	607.2	961.1	1115.1	1318.1	1005.3
jasper	1	17.6	16.8	21.6	23.8	26.8	1
jasper	64	92.8	92.3	118.3	125.9	145.4	63.8
jasper	128	156.8	160.9	205.7	223.7	243.1	127.5
jasper	256	244.9	254.1	324.8	356.2	378.1	254.1
jasper	384	311.1	315.7	411.7	435.9	474.4	380.7
jasper	512	381	387.2	510.8	537.8	614.4	506.6
jasper	768	512.6	510.3	689.4	734.8	1110.5	757
jasper	1024	749.3	696.7	1228.9	1430.7	1579	1004

3200ms chunk¶

Acoustic model	# of streams	Latency (ms)					Throughput (RTFX)
		avg	p50	p90	p95	p99
quartznet	1	28.1	28.8	32.4	32.5	32.5	1
quartznet	256	356.7	397.7	478.3	493.1	518.9	253.8
quartznet	512	566.5	591.5	780.1	803.4	841.8	505.2
quartznet	768	729.1	721.9	990.8	1030.3	1074.4	753.4
quartznet	1024	899.3	937.7	1226	1315.2	1514	1000.1
quartznet	1280	1052.1	1037.9	1537.7	1793.6	2100	1244.9
quartznet	1512	1303.8	1301.7	1847.9	2149.6	2464.6	1460.2
jasper	1	31	33.4	35	35.3	35.3	1
jasper	256	422.1	451.1	548.4	568.1	583.5	253.6
jasper	512	667.5	697.5	864.8	890.7	926.3	504.1
jasper	768	865.4	898.6	1106.3	1143.5	1225.6	752.3
jasper	1024	1089	1083.8	1480.4	1617.3	2038.3	997.2
jasper	1280	1382.5	1386.3	2041.7	2380.1	2559.1	1237.2
jasper	1512	1753.8	1735	2629.3	2779.8	2970.5	1448.8

NVIDIA V100 GPU¶

100ms chunk¶

Acoustic model	# of streams	Latency (ms)					Throughput (RTFX)
		avg	p50	p90	p95	p99
quartznet	1	8.8	8.3	9.7	11.3	21.7	1
quartznet	8	15	14	17	20.2	43	8
quartznet	16	22.4	21.4	25.8	27.6	57.6	16
quartznet	32	36.1	36.2	41.8	44.4	72.9	31.9
quartznet	48	44.6	44.8	53	55.7	85.4	47.9
quartznet	64	54.9	55.1	67	73.1	102.5	63.8
quartznet	96	81.2	84.3	99.2	111.8	179.2	95.7
quartznet	128	114.7	109.3	157.3	181.5	228.2	127.4
jasper	1	21.5	21	22.2	24	31.2	1
jasper	8	27.6	26.5	29.7	34.7	53.4	8
jasper	16	36.9	34	49	51.3	58.8	16
jasper	32	74.5	72.5	88.1	91.6	126.3	31.9
jasper	48	117.5	101.1	175.4	186.6	224.5	47.9
jasper	64	406.4	365.7	645.5	695.1	806.5	63.6
jasper	96	14378	13737	25542	27829	32182	72.8
jasper	128	28826	28125	53029	56965	63537	66.2

800ms chunk¶

Acoustic model	# of streams	Latency (ms)					Throughput (RTFX)
		avg	p50	p90	p95	p99
quartznet	1	14.4	13.6	20.4	20.6	20.7	1.0
quartznet	64	79.3	77.2	111.3	120.2	130.1	63.8
quartznet	128	135.1	128.9	195.7	204.9	219.0	127.4
quartznet	256	222.2	218.7	315.2	339.2	362.2	254.3
quartznet	384	310.9	304.9	443.8	479.9	520.5	380.3
quartznet	512	385.2	374.5	569.0	589.6	626.2	505.4
quartznet	768	574.5	527.0	937.3	1226.6	1347.8	751.9
quartznet	1024	1088.1	946.2	1752.3	2116.6	2544.2	981.6
jasper	1	26.8	25.9	32.8	35.3	56.6	1.0
jasper	64	138.3	134.0	170.8	181.5	203.3	63.8
jasper	128	239.4	234.9	294.9	310.2	342.8	127.2
jasper	256	416.0	416.8	509.2	556.0	588.2	253.3
jasper	384	613.6	597.9	766.6	919.4	1271.1	378.0
jasper	512	969.7	858.2	1503.9	1860.3	2297.8	499.7
jasper	768	9170.1	9241.0	15868.0	16618.0	18224.0	591.1
jasper	1024	22837.0	23248.0	37553.0	40249.0	42696.0	579.8

3200ms chunk¶

Acoustic model	# of streams	Latency (ms)					Throughput (RTFX)
		avg	p50	p90	p95	p99
quartznet	1	32.933	35.423	37.712	38.012	38.012	0.9994
quartznet	256	461.44	488.88	630.67	653.84	684.75	253.1
quartznet	512	784.73	843.69	1069.8	1105.7	1154.2	501.66
quartznet	768	1121.6	1114.7	1601.7	1971.7	2138.5	747.45
quartznet	1024	1551.5	1592.9	2258.9	2463.8	2608.1	985.6
quartznet	1280	1982.2	2080.8	2910.2	3062.1	3279.6	1211.7
quartznet	1512	2305.8	2241.4	3625.4	4190.5	4989.9	1413.3
jasper	1	48.351	49.407	51.954	79.174	79.174	0.99919
jasper	256	734.99	751.2	897.03	916.36	941.26	252.12
jasper	512	1423.3	1384.4	2263.9	2387.1	2477.4	497.69
jasper	768	2190.2	2133.8	3255.7	3393	3482.7	730.15
jasper	1024	3576.3	2847.7	5861.6	6062.2	6748.6	951.97
jasper	1280	13698	12101	28644	32940	35311	1001.1
jasper	1512	19705	16730	40679	43397	46270	1014.6

NVIDIA T4¶

100ms chunk¶

Acoustic model	# of streams	Latency (ms)					Throughput (RTFX)
		avg	p50	p90	p95	p99
quartznet	1	19.2	18.4	21.6	23.0	38.4	1.0
quartznet	8	36.0	34.4	41.4	45.9	82.7	8.0
quartznet	16	56.4	54.8	66.0	70.6	113.9	16.0
quartznet	32	70.9	71.0	82.4	93.7	160.0	31.9
quartznet	48	99.0	96.5	128.0	152.7	210.8	47.8
quartznet	64	242.4	224.1	354.0	407.2	479.6	63.7
quartznet	96	24151.0	22486.0	42624.0	47420.0	50429.0	58.7
quartznet	128	43821.0	44736.0	77326.0	81324.0	87343.0	53.7
jasper	1	46.9	46.9	49.6	52.7	65.7	1.0
jasper	8	51.1	51.7	58.6	66.0	95.9	8.0
jasper	16	84.4	81.7	97.3	104.1	187.7	16.0
jasper	32	2328.1	2017.9	4183.5	5180.6	7012.1	31.6
jasper	48	16858.0	14761.0	32993.0	35911.0	38084.0	35.1
jasper	64	25504.0	22164.0	47484.0	51189.0	55003.0	37.0
jasper	96	38857.0	41576.0	59410.0	63763.0	69797.0	38.2
jasper	128	55384.0	57791.0	89744.0	94712.0	98622.0	38.7

800ms chunk¶

Acoustic model	# of streams	Latency (ms)					Throughput (RTFX)
		avg	p50	p90	p95	p99
quartznet	1	33.183	33.444	44.144	44.813	46.354	0.99914
quartznet	64	162.63	162.72	214.48	226.93	253.69	63.725
quartznet	128	263.6	263.68	334.96	353.4	375.9	127.11
quartznet	256	449.28	447.25	559.87	591.62	644.3	252.7
quartznet	384	732.75	682.62	986.42	1360.7	1539.3	375.95
quartznet	512	2037.5	2001.9	3136.3	3815.6	4684.4	487.93
quartznet	768	15721	15724	27569	28450	29961	493.95
quartznet	1024	29223	29487	49967	51824	53910	494.05
jasper	1	72.377	72.143	82.132	89.374	90.067	0.99848
jasper	64	259.64	262.21	298.47	311.66	331.8	63.62
jasper	128	450.81	452.22	529.64	547.49	584.69	126.62
jasper	256	1200.8	978.29	1809.4	2446.7	3595.1	249.24
jasper	384	11679	11833	19190	20312	22493	279.91
jasper	512	23750	23537	39610	41101	43670	280.41
jasper	768	46165	49046	74417	79363	83407	279.8
jasper	1024	67973	69939	114000	121000	126000	280.61

3200ms chunk¶

Acoustic model	# of streams	Latency (ms)					Throughput (RTFX)
		avg	p50	p90	p95	p99
quartznet	1	157.62	160.64	168.29	168.31	168.31	0.99726
quartznet	256	906.17	915.19	1098.4	1130.8	1163.2	251.35
quartznet	512	1515.2	1491.2	2244.4	2429.9	2540.8	494.82
quartznet	768	2398.4	2216.6	3447	3586.4	3909.8	722.55
quartznet	1024	4636.2	4727.7	7782.6	8737.9	8969.3	926.66
quartznet	1280	17263	15966	36103	40196	44408	872.88
quartznet	1512	25038	24528	49704	56065	60136	875.68
jasper	1	96.201	100.64	104.75	104.82	104.82	0.99831
jasper	256	1758.4	1668.3	2718.5	2764.3	2811.6	247.1
jasper	512	11593	9623.5	25483	28937	30681	432.78
jasper	768	28073	27499	55288	57262	63169	440.06
jasper	1024	44405	44756	83588	86835	92653	445.39
jasper	1280	61336	65536	114000	117000	126000	446.78
jasper	1512	76306	83556	140000	145000	153000	447.83

NLP¶

Performance of the Riva named entity recognition (NER) service (using a BERT-base model, sequence length of 128) and the Riva Question Answering (QA) service (using a BERT-large model, sequence length of 384) was measured in Riva. Batch size 1 latency and maximum throughput were measured.

NVIDIA A100 GPU¶

Task	# of streams	Latency (ms)					Throughput (seq/s)
		avg	p50	p90	p95	p99
NER	1	3.19	3.15	3.3	3.44	3.88	311.1
NER	256	95.5	96.1	108	113	118	2548.8
Q&A	1	4.95	4.83	5.25	5.36	5.77	201.2
Q&A	128	279	290	294	308	321	453.1

NVIDIA V100 GPU¶

Task	# of streams	Latency (ms)					Throughput (seq/s)
		avg	p50	p90	p95	p99
NER	1	4.87	4.84	5.07	5.11	5.29	204.2
NER	256	135	135	154	160	164	1796.8
Q&A	1	7.47	7.44	7.58	7.62	7.78	133.5
Q&A	128	521	541	543	544	626	243.8

NVIDIA T4¶

Task	# of streams	Latency (ms)					Throughput (seq/s)
		avg	p50	p90	p95	p99
NER	1	9.31	9.19	9.94	10.2	11.1	106.7
NER	256	255	265	282	285	289	960.2
Q&A	1	11.5	11.3	11.4	11.4	11.5	86.9
Q&A	128	571	582	672	684	768	223.1

TTS¶

Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured.

NVIDIA A100 GPU¶

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.06	0.06	0.06	0.06	0.04	0.04	0.04	0.04	19.5
4	0.48	0.67	0.71	0.78	0.03	0.05	0.06	0.11	37.0
6	0.69	0.89	0.94	1.06	0.03	0.05	0.07	0.10	41.8
8	0.88	1.10	1.15	1.25	0.03	0.06	0.07	0.10	45.8
10	1.06	1.21	1.26	1.43	0.03	0.06	0.08	0.09	48.7

NVIDIA V100 GPU¶

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.08	0.08	0.08	0.25	0.05	0.06	0.06	0.06	14.31
4	0.77	0.98	1.07	1.19	0.05	0.07	0.08	0.13	23.3
6	1.11	1.47	1.56	1.71	0.05	0.09	0.11	0.17	25.55
8	1.4	1.81	1.9	2.06	0.06	0.1	0.12	0.17	28.09
10	1.74	2.37	2.52	2.78	0.07	0.12	0.14	0.17	27.75

NVIDIA T4¶

# of streams	Latency to first audio (s)				Latency between audio chunks (s)				Throughput (RTFX)
	avg	p90	p95	p99	avg	p90	p95	p99
1	0.12	0.12	0.12	0.12	0.07	0.07	0.07	0.07	11.17
4	1.02	1.37	1.43	1.52	0.07	0.11	0.13	0.19	17.14
6	1.59	2.05	2.15	2.32	0.07	0.12	0.15	0.25	18.16
8	2.13	2.59	2.71	2.88	0.08	0.14	0.18	0.26	18.83
10	2.55	3.42	3.65	4.03	0.1	0.2	0.24	0.34	18.37

When the server is under high load, requests might time out, as the server will not start inference for a new request until a previous request is completely generated so that inference slot can be freed. This is done to maximize throughput for the TTS service and allow for real-time interaction. NVIDIA does not recommend making more than 8-10 simultaneous requests with the models provided in Riva 1.0.0 beta.