Performance#

Evaluation Process#

This section shows the latency and throughput numbers for the Riva NMT service on different GPUs.

These numbers were captured after the preconfigured NMT pipelines were deployed from our Quick Start scripts.

The command used to measure performance was:

riva_nmt_t2t_client
  --riva_uri=0.0.0.0:50051
  --model_name=<model name>
  --batch_size=<batch size>
  --target_language_code=<target language code>
  --source_language_code=<source language code>
  --text_file=<wmt_filename>

The riva_nmt_t2t_client returns the following latency measurements:

  • latency: the overall latency of all returned responses. This is what is tabulated in the tables below.

Results#

Latencies and throughput measurements with the megatronnmt_any_any_1b model are reported in the following tables. Throughput is measured in terms of tokens (words) translated per second.

For specifications of the hardware on which these measurements were collected, refer to the Hardware Specifications section.

batch size

tokens/second

p90

p95

p99

1

26.2867

2.03927

2.50127

3.38444

2

37.9656

2.66554

3.13718

4.35343

4

52.68

3.90231

4.27135

6.70392

8

60.0354

6.66561

7.94382

12.7633

Hardware Specifications#

GPU

NVIDIA DGX A100 40 GB

CPU

Model

AMD EPYC 7742 64-Core Processor

Thread(s) per core

2

Socket(s)

2

Core(s) per socket

64

NUMA node(s)

8

Frequency boost

enabled

CPU max MHz

2250

CPU min MHz

1500

RAM

Model

Micron DDR4 36ASF8G72PZ-3G2B2 3200MHz

Configured Memory Speed

2933 MT/s

RAM Size

32x64GB (2048GB Total)

Model Accuracy#

Riva NMT models are evaluated using the BLEU (Bilingual Evaluation Understudy) score, which is an industry-standard metric for evaluating machine translation quality.

BLEU scores range from 0 to 100, where higher scores indicate better translation quality. The score measures how similar the machine translation output is to one or more reference human translations by:

  1. Comparing n-gram matches between the machine translation and reference translations

  2. Applying penalties for translations that are too short or too long

  3. Combining these components into a final score

The table below shows BLEU scores of RIVA NMT Megatron 1.6B any2any model for translation between any pair of languages in the supported set (de, es-ES, es-US, fr, ja, ru, zh-CN) for Flores-101 dataset, where each row represents the source language and each column represents the target language. Higher scores indicate better translation quality.

Source/Target

de

es-ES

es-US

fr

ja

ru

zh-CN

de

-

24.5

24.1

39.3

27.3

26.1

33.3

es-ES

22.1

-

-

30.3

23.5

20.2

29.8

es-US

22.1

-

-

30.3

23.5

20.2

29.8

fr

25.0

24.8

30.4

-

26.6

25.5

32.7

ja

16.9

16.4

18.1

23.7

-

15.2

28.9

ru

22.4

21.9

26.4

33.4

25.4

-

30.9

zh-CN

17.5

17.3

19.1

25.6

16.8

23.7

-

The table below shows BLEU scores of RIVA NMT Megatron 1.6B any2any model for translation between English and various target languages for Flores-101 dataset.

Language

English to Target ⬆️

Target to English ⬆️

Arabic

28.0

40.6

Brazilian Portuguese

49.8

50.5

Bulgarian

41.8

42.1

Croatian

27.9

37.8

Czech

32.9

41.1

Danish

46.2

49.6

Dutch

26.7

32.6

Estonian

27.3

38.9

European Portuguese

48.1

50.5

European Spanish

27.6

30.7

Finnish

22.7

35.0

French

50.5

46.5

German

38.2

45.2

Greek

27.5

36.5

Hindi

33.5

39.9

Hungarian

26.7

36.9

Indonesian

47.2

44.9

Italian

29.9

34.5

Japanese

32.5

26.7

Korean

28.0

29.5

Latin American Spanish

26.8

30.7

Latvian

31.0

37.0

Lithuanian

27.5

35.1

Norwegian

34.0

44.8

Polish

20.8

30.3

Romanian

40.7

45.0

Russian

31.3

36.1

Simplified Chinese

39.5

28.5

Slovak

35.0

40.6

Slovenian

30.7

36.2

Swedish

45.0

49.6

Thai

30.9

28.1

Traditional Chinese

30.8

26.8

Turkish

29.5

38.8

Ukrainian

30.7

40.2

Vietnamese

41.8

36.9