Performance#

Evaluation Process#

This section shows the latency and throughput numbers for streaming and offline configurations of the Riva ASR service on different GPUs. These numbers were captured after the pre-configured ASR pipelines from our Quick Start scripts were deployed. The Conformer and Parakeet acoustic models were tested.

In streaming mode, the client and the server used audio chunks of the same duration (100ms, 160ms, and 800ms depending on the server configuration). Refer to the Results section for the chunk size value to use.

The Riva streaming client riva_streaming_asr_client, provided in the Riva image, was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing three iterations over a sample audio file (1272-135031-0000.wav) from the LibriSpeech dev-clean dataset. The LibriSpeech datasets can be obtained from https://www.openslr.org/12.

The source code for the riva_streaming_asr_client can be obtained from https://github.com/nvidia-riva/cpp-clients.

The command used to measure performance was:

riva_streaming_asr_client \
   --chunk_duration_ms=<chunk_duration> \
   --simulate_realtime=true \
   --automatic_punctuation=true \
   --num_parallel_requests=<num_streams> \
   --word_time_offsets=false \
   --print_transcripts=false \
   --interim_results=false \
   --num_iterations=<3*num_streams> \
   --audio_file=1272-135031-0000.wav \
   --output_filename=/tmp/output.json

The riva_streaming_asr_client returns the following latency measurements:

  • intermediate latency: latency of responses returned with is_final == false

  • final latency: latency of responses returned with is_final == true

  • latency: the overall latency of all returned responses. This is what is tabulated in the following tables.

Refer to the following diagram for a schematic representation of the different latencies measured by the Riva streaming ASR client.

Schematic Diagram of Latencies Measured by Riva Streaming ASR Client

In offline mode, the command used to measure maximum throughput was:

riva_asr_client \
   --automatic_punctuation=true \
   --num_parallel_requests=32 \
   --word_time_offsets=false \
   --print_transcripts=false \
   --num_iterations=96 \
   --audio_file=1272-135031-0000x5.wav \
   --output_filename=/tmp/output.json

where 1272-135031-0000x5.wav is simply the 1272-135031-0000.wav audio file concatenated five times. The source code for the riva_asr_client can be obtained from: https://github.com/nvidia-riva/cpp-clients

Results#

Latencies and throughput measurements for streaming and offline configurations are reported in the following tables. Throughput is measured in RTFX (duration of audio transcribed / computation time).

Note

Audio files were iterated 1 time for Xavier AGX, Xavier NX, and Orin AGX and 3 times for all other experiments.

Note

If the language model is none, the inference is performed with a greedy decoder. If the language model is n-gram, then a beam decoder was used.

Note

The values in the tables are average values over 3 trials. The values in the table are rounded to the last significant digit according to standard deviation calculated on 3 trials. If a standard deviation is less than 0.001 of the average, then the corresponding value is rounded as if standard deviation equals 0.001 of the value.

For specifications of the hardware on which these measurements were collected, refer to the Hardware Specifications section. Please notice, that

  • results on AWS and GCP are computed using Riva 2.4.0

  • results On-Prem are computed using Riva 2.15.0.

Cloud instance descriptions for AWS and GCP.

Chunk size (ms): 160
Maximum effective # of streams with n-gram language model: 218
Maximum effective # of streams without language model (greedy generation): 223

Language model

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

n-gram

1

13

11.9

12.8

13

40

0.999

n-gram

8

18.8

17.4

19

20

57

7.99

n-gram

16

24.8

22

30

32

80

15.96

n-gram

32

34

30

43

46

110

31.86

n-gram

48

44

41

60

66

160

47.7

n-gram

64

50

50

67

75

200

63.6

n-gram

128

86

67

100

220

360

126.5

n-gram

256

1700

1600

3200

3200

3700

229

none

1

12

11.3

12

12.5

30

1

none

8

17

15.8

16.6

20

49.6

7.99

none

16

22.1

19.9

26

29.5

70

15.96

none

32

32

30

39.7

44

100

31.9

none

48

40

40

56

57

160

47.7

none

64

46

45

60

65

170

63.6

none

128

80

60

97

200

330

126.5

none

256

1400

1350

2600

2700

3000

233