Performance#

Evaluation Process#

This section shows the latency and throughput numbers for streaming and offline configurations of the Riva ASR service on different GPUs.

In streaming mode, the client and the server used audio chunks of the same duration. See the Results section for the chunk size value to use.

The Riva streaming client riva_streaming_asr_client, provided in the Riva image, was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing three iterations over a sample audio file (1272-135031-0000.wav) from the LibriSpeech dev-clean dataset.

You can get the source code for the riva_streaming_asr_client at Riva C++ Clients.

The following command was used to measure performance:

riva_streaming_asr_client \
   --chunk_duration_ms=<chunk_duration> \
   --simulate_realtime=true \
   --automatic_punctuation=true \
   --num_parallel_requests=<num_streams> \
   --word_time_offsets=false \
   --print_transcripts=false \
   --interim_results=false \
   --num_iterations=<3*num_streams> \
   --audio_file=1272-135031-0000.wav \
   --output_filename=/tmp/output.json

The riva_streaming_asr_client command returns the following latency measurements:

intermediate latency: latency of responses returned with is_final == false
final latency: latency of responses returned with is_final == true
latency: the overall latency of all returned responses. This is what is tabulated in the following tables.

The following diagrams are a schematic representation of the different latencies measured by the Riva streaming ASR client.

Schematic Diagram of Latencies Measured by Riva Streaming ASR Client

The following command was used to measure maximum throughput in offline mode:

riva_asr_client \
   --automatic_punctuation=true \
   --num_parallel_requests=32 \
   --word_time_offsets=false \
   --print_transcripts=false \
   --num_iterations=96 \
   --audio_file=1272-135031-0000x5.wav \
   --output_filename=/tmp/output.json

where 1272-135031-0000x5.wav is the 1272-135031-0000.wav audio file concatenated five times. You can get the source code for the riva_asr_client at Riva C++ Clients.

Results#

Latencies and throughput measurements for streaming and offline configurations are reported in the following tables. Throughput (duration of audio transcribed / computation time) is measured in RTFX.

Note

The values in the tables are average values over three trials. The values in the table are rounded to the last significant digit according to the standard deviation calculated on three trials. If a standard deviation is less than 0.001 of the average, then the corresponding value is rounded as if standard deviation equals 0.001 of the value.

For specifications of the hardware on which these measurements were collected, see the Hardware Specifications section.

On-Prem

H100

Parakeet-0.6B-CTC

English (US)

Streaming, low-latency

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 270

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	14	12.7	13.4	13.7	40	0.999
8	15.3	14.3	15.5	20	45.5	7.99
16	21	18	26	28	60	15.97
32	28	28	37	40	90	31.9
48	35	35	46	47.4	100	47.8
64	42	40	54	55	130	63.7

Streaming, high-throughput

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 1240

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	20	13.7	20	30	100	1
64	53	50	64	120	150	63.7
128	77	65	96	200	245	127
256	120	107	156	300	440	252.5
384	162	145	220	440	640	376
512	200	180	276	530	800	499

Offline

Language model: n-gram

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	300
False	32	2200
True	32	170

Parakeet-1.1B-CTC

English (US)

Streaming, low-latency

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 160

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	18	16.7	17.4	20	40	0.999
8	21	19.8	21	30	50.4	7.99
16	29	26	40	41	70	15.96
32	41	45	51	55	110	31.9
48	53	58	71	73	160	47.75
64	65.5	72	83	85	210	63.6

Streaming, high-throughput

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 770

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	20	18.4	20	36	100	0.999
64	80	83	95	140	200	63.7
128	119	100	155	230	296	126.8
256	190	170	270	400	530	252
384	260	245	378	580	800	374.5
512	346	330	496	850	1200	494

Offline

Language model: n-gram

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	200
False	32	2000
True	32	120

Parakeet-1.1B-RNNT

Multilingual

Streaming, low-latency

Chunk size (ms): 320
Language model: n-gram
Maximum effective # of streams with n-gram language model: 5

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	100	95	103.5	105	200	0.994
8	7600	7000	16000	17000	19000	5.25

Streaming, high-throughput

Chunk size (ms): 1600
Language model: n-gram
Maximum effective # of streams with n-gram language model: 25

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	100	98	111	112	300	0.997
64	27000	24000	50000	50000	54000	25

Offline

Language model: n-gram

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	80
False	32	130

Conformer-CTC

Spanish (US)

Streaming, low-latency

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 355

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	10	9.9	11.3	12	40	1
8	12.6	12	13.4	17	31	8
16	17	15	22	25	40	15.98
32	23	23	31	33	50	31.94
48	29	28	40	41	70	47.9
64	33.6	38	45	47	70	63.9
128	49	47	64	67	150	127.6
256	84	75	107	126	391	255

Streaming, high-throughput

Chunk size (ms): 800
Language model: n-gram
Maximum effective # of streams with n-gram language model: 1400

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	14	11	20	40	80	1
64	39	40	55	80	110	63.9
128	58	50	75	150	202	127.6
256	90	80	115	240	380	255
384	120	107	155	316	530	381.4
512	149	130	196	400	700	508
768	258	200	630	680	1280	756
1024	420	263	1280	1350	1900	992

Offline

Language model: n-gram

# of streams	Throughput (RTFX)
32	467

Whisper-Large

Multilingual

Offline

Language model: n-gram

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	90
False	32	370

Canary-1B

Multilingual

Offline

Language model: none

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	11
False	32	77

Canary-0.6B-Turbo

Multilingual

Offline

Language model: none

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	40
False	32	300

A100

Parakeet-CTC-0.6B

English (US)

Streaming, low-latency

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 179

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	16	15.3	16.2	16.3	40	0.999
8	21.6	20.4	22	23	59	7.99
16	28	26.4	30	39	80	15.96
32	41.4	40	53	54	130	31.85
48	49	54	64	66	160	47.7
64	59	67	75	76	216	63.6

Streaming, high-throughput

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 810

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	20	19.6	30	40	100	0.999
64	90	93	110	200	240	63.5
128	115	100	140	260	350	126.6
256	185	163	248	451	630	251
384	254	230	350	630	930	373
512	362	300	730	940	1550	491

Offline

Language model: n-gram

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	300
False	32	2000
True	32	125

Parakeet-CTC-1.1B

English (US)

Streaming, low-latency

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 104

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	24	22.7	23.7	25	50	0.999
8	32.7	31	33	51	72.7	7.98
16	44	40.8	50	63	110	15.94
32	59	60	73	75	180	31.8
48	79	90	93	100	240	47.6
64	100	109	114	160	310	63.4

Streaming, high-throughput

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 490

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	30	29.1	40	50	100	0.999
64	123	130	160	240	260	63.5
128	185	165	240	360	430	126.4
256	300	266	430	630	830	249.4
384	460	445	770	1100	1560	368
512	720	650	1400	1550	2150	483

Offline

Language model: n-gram

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	180
False	32	1330
True	32	75

Conformer-CTC

Spanish (US)

Streaming, low-latency

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 233

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	13	11.8	12.8	14	40	1
8	17.6	16.8	18.5	22	39	8
16	22.5	21.3	25	31	60.3	15.98
32	32.4	35	42	46	70	31.93
48	41	40	58	59	100	47.9
64	46	50	64	66	100	63.8
128	73	66	94	97	220	127.5

Streaming, high-throughput

Chunk size (ms): 800
Language model: n-gram
Maximum effective # of streams with n-gram language model: 980

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	16	13	20	40	80	1
64	60	60	80	110	180	63.8
128	90	80	110	230	300	127.5
256	133.3	120	174	340	530	254
384	183	166	245	430	800	380
512	260	223	510	600	1200	505
768	535	354	1500	1640	2150	739
1024	940	600	2300	2570	2930	960

Offline

Language model: n-gram

# of streams	Throughput (RTFX)
32	460

Whisper-Large

Multilingual

Offline

Language model: n-gram

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	60
False	32	234

Canary-1B

Multilingual

Offline

Language model: none

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	5.7
False	32	38.75

Canary-0.6B-Turbo

Multilingual

Offline

Language model: none

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	24
False	32	168

L40

Parakeet-CTC-0.6B

English (US)

Streaming, low-latency

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 190

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	19	18.3	19.3	20	43.5	0.999
8	24	23	30	30	65	7.98
16	31.4	29	38.3	42	80	15.96
32	42	42	57	60	100	31.9
48	52	53	69.6	75	130	47.8

Streaming, high-throughput

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 900

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	25	22	30	50	90	0.999
64	90	90	110	160	200	63.6
128	120	100	150	240	330	126.8
256	180	160	240	400	560	251.5

Offline

Language model: n-gram

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	240
False	32	2030
True	32	101.5

Parakeet-CTC-1.1B

English (US)

Streaming, low-latency

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 110

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	25	23	29	30	50	0.999
8	31	29	35.5	46	70	7.98
16	44	40	56	60	100	15.95
32	60	62	76	80	150	31.84
48	80	86	100	112	227	47.7

Streaming, high-throughput

Chunk size (ms): 960
Language model: n-gram
Maximum effective # of streams with n-gram language model: 578

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	30	27	40	50	100	0.999
64	120	130	150	200	240	63.5
128	170	150	220	310	380	126.5
256	270	250	390	540	700	250.5

Offline

Language model: n-gram

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	180
False	32	1440
True	32	94

Conformer-CTC

Spanish (US)

Streaming, low-latency

Chunk size (ms): 160
Language model: n-gram
Maximum effective # of streams with n-gram language model: 280

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	11	10.3	11.2	12.4	30	1
8	20	19	26	30	42	7.99
16	28	26	35	40	56	15.97
32	35	35	48	52	73	31.9
64	50	55	66	70	100	63.8

Streaming, high-throughput

Chunk size (ms): 800
Language model: n-gram
Maximum effective # of streams with n-gram language model: 1180

# of streams	Latency (ms)					Throughput (RTFX)
	avg	p50	p90	p95	p99
1	14	11.5	20	30	60	1
64	70	70	90	100	170	63.8
128	88	84	110	190	250	127.4
256	128	117	164	300	460	254.4

Offline

Language model: n-gram

# of streams	Throughput (RTFX)
32	440

Whisper-Large

Multilingual

Offline

Language model: n-gram

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	70
False	32	193.5

Canary-1B

Multilingual

Offline

Language model: none

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	6.2
False	32	43.3

Canary-0.6B-Turbo

Multilingual

Offline

Language model: none

Speaker Diarization	# of streams	Throughput (RTFX)
False	1	28
False	32	192

On-Prem Hardware Specifications#

A100

GPU
NVIDIA DGX A100 40GB
CPU
Model	AMD EPYC 7742 64-Core Processor
Thread(s) per core	2
Socket(s)	2
Core(s) per socket	64
NUMA node(s)	8
Frequency boost	enabled
CPU max MHz	2250
CPU min MHz	1500
RAM
Model	Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz
Configured Memory Speed	2933 MT/s
RAM Size	32x64GB (2048GB Total)

H100

GPU
NVIDIA H100 80GB HBM3
CPU
Model	Intel(R) Xeon(R) Platinum 8480CL
Thread(s) per core	2
Socket(s)	2
Core(s) per socket	56
NUMA node(s)	2
CPU max MHz	3800
CPU min MHz	800
RAM
Model	Micron DDR5 MTC40F2046S1RC48BA1 4800MHz
Configured Memory Speed	4400 MT/s
RAM Size	32x64GB (2048GB Total)

L40

GPU
NVIDIA L40
CPU
Model	AMD EPYC 7763 64-Core Processor
Thread(s) per core	1
Socket(s)	2
Core(s) per socket	64
NUMA node(s)	8
Frequency boost	enabled
CPU max MHz	3529
CPU min MHz	1500
RAM
Model	Samsung DDR4 M393A4K40DB3-CWE 3200MHz
Configured Memory Speed	3200 MT/s
RAM Size	16x32GB (512GB Total)

Model Accuracy#

Riva ASR models are evaluated using Word Error Rate (WER) for word-based languages such as English, Spanish, and French, and Character Error Rate (CER) for character-based languages such as Chinese, Japanese, and Mandarin.

WER measures the minimum number of word substitutions, insertions, and deletions required to transform the model’s output into the reference transcript, divided by the total number of words in the reference. Similarly, CER calculates the minimum number of character edits needed, divided by the total number of characters in the reference.

Lower WER/CER values indicate better accuracy, with 0% representing perfect transcription.

Model Name	Language	Dataset	Best latency WER (%) ⬇️	Best throughput WER (%) ⬇️	Offline WER (%) ⬇️
Parakeet-RNNT-1.1b	en-US	MCV 7.1 test set	10.74	10.54	9.77
	es-US	MLS test set	7.1909	5.2679	3.8335
	es-ES	Mediaspeech	16.156	14.4264	11.51
	fr-FR	MLS test set	11.4124	9.1087	6.36
	de-DE	MLS test set	11.2974	9.1616	7.09
	ru-RU	RuLS test set	21.4456	19.2387	17.39
Parakeet-CTC-1.1b	en-US	MCV 7.1 test set	10.45	8.80	7.96
	en-US	LibriSpeech test-other	6.34	4.74	4.09
	en-US	CallHome (CH109)	46.09	41.35	39.61
	en-US (Silero VAD)	LibriSpeech test-other	5.57	4.8	4.5
	en-US (Telephony)	LibriSpeech test-other	7.33	5.11	4.17
	en-US (Telephony)	CallHome (CH109)	30.13	27.82	28.91
Parakeet-CTC-0.6b	en-US	MCV 7.1 test set	10.57	8.87	8.45
Canary-1B	en-US	MCV 7.1 test set	Not supported	Not supported	6.78
	es-US	MLS test set	Not supported	Not supported	3.54
	de-DE	MLS test set	Not supported	Not supported	5.18
	fr-FR	MLS test set	Not supported	Not supported	4.21
	ru-RU	MCV 7.0 test set	Not supported	Not supported	10.33
	es-ES	Mediaspeech	Not supported	Not supported	14.40
	pt-BR	MCV 10.0 test set	Not supported	Not supported	5.83
Canary-0.6B	en-US	MCV 7.1 test set	Not supported	Not supported	8.65
	es-US	MLS test set	Not supported	Not supported	3.42
	de-DE	MLS test set	Not supported	Not supported	5.18
	fr-FR	MLS test set	Not supported	Not supported	4.66
	ru-RU	MCV 7.0 test set	Not supported	Not supported	13.39
	es-ES	Mediaspeech	Not supported	Not supported	13.21
	pt-BR	MCV 10.0 test set	Not supported	Not supported	6.38
Conformer-CTC-120M	es-US	MCV 7.1 test set	6.75	6.26	5.66