Performance

Evaluation Process

This section shows the latency and throughput numbers for streaming and offline configurations of the Riva ASR service on different GPUs. These numbers were captured after the pre-configured ASR pipelines from our Quick Start scripts were deployed. The Jasper, QuartzNet, Conformer and Citrinet-1024 acoustic models were tested.

In streaming mode, the client and the server used audio chunks of the same duration (100ms, 160ms, 800ms depending on the server configuration). The Riva streaming client riva_streaming_asr_client, provided in the Riva client image, was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 3 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav). The command used to measure performance was:

riva_streaming_asr_client \
   --chunk_duration_ms=<chunk_duration> \
   --simulate_realtime=true \
   --automatic_punctuation=true \
   --num_parallel_requests=<num_streams> \
   --word_time_offsets=true \
   --print_transcripts=false \
   --interim_results=false \
   --num_iterations=<3*num_streams> \
   --audio_file=1272-135031-0000.wav \
   --output_filename=/tmp/output.json

The riva_streaming_asr_client returns the following latency measurements:

  • intermediate latency: latency of responses returned with is_final == false

  • final latency: latency of responses returned with is_final == true

  • latency: the overall latency of all returned responses. This is what is tabulated in the tables below.

In offline mode, the command used to measure maximum throughput was:

riva_asr_client \
   --automatic_punctuation=true \
   --num_parallel_requests=32 \
   --word_time_offsets=true \
   --print_transcripts=false \
   --num_iterations=96 \
   --audio_file=5x_1272-135031-0000.wav \
   --output_filename=/tmp/output.json

Results

Latencies and throughput measurements for streaming and offline configurations are reported in the following tables. Throughput is measured in RTFX (duration of audio generated / computation time).

Note

If the language model is none, the inference is performed with a greedy decoder. If the language model is n-gram, then a beam decoder was used.

For specifications of the hardware on which these measurements were collected, refer to the Hardware Specifications section.

Chunk size (ms): 160
Maximum effective # of streams with n-gram language model: 229
Maximum effective # of streams without language model: 273

Language model

# of streams

Latency (ms)

Throughput (RTFX)

avg

p50

p90

p95

p99

n-gram

1

10.68

10.34

10.77

12.4

17.4

0.999430

n-gram

8

14.70

14.28

14.96

16.4

28.8

7.9921

n-gram

16

27.1

25.5

30.3

32.1

59.3

15.9657

n-gram

32

42.0

41.44

44.0

45.5

87.1

31.9023

n-gram

48

50.8

50.6

55.5

57.6

108

47.814

n-gram

64

57.3

56.5

64.7

67.7

120

63.725

none

1

9.97

9.74

9.97

11.6

14.70

0.999513

none

8

14.43

14.1

14.93

15.34

27.1

7.99203

none

16

27.0

26.8

28.3

29.7

55.3

15.967

none

32

37.4

36.9

38.7

40.2

76.7

31.914

none

48

45.5

45.0

51.6

53.51

94.0

47.843

none

64

48.7

49.7

58.7

61.1

103.1

63.766

Hardware Specifications

GPU

NVIDIA DGX A100 40 GB

CPU

Model

AMD EPYC 7742 64-Core Processor

Thread(s) per core

2

Socket(s)

2

Core(s) per socket

64

NUMA node(s)

8

Frequency boost

enabled

CPU max MHz

2250

CPU min MHz

1500

RAM

Model

Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz

Configured Memory Speed

2933 MT/s

RAM Size

32x64GB (2048GB Total)