MT5 Results

Training Accuracy Results

Training accuracy: NVIDIA DGX SuperPOD

  • 4 × 8 × A100 80GB for 170M mT5 model

  • 8 × 8 × A100 80GB for 390M mT5 model

  • 20 × 8 × A100 80GB for 3B mT5 model)

NVIDIA evaluated the mT5 models on an XQuAD task. The results are shown in the table below. You can fine-tune on top of any .nemo-trained checkpoint file on an XQuAD task mentioned in the section mT5 Fine-Tuning.

Task-Language

Metric

170M

390M

XQuAD-de

Exact Match

43.0

54.7

XQuAD-en

Exact Match

63.8

68.8

XQuAD-es

Exact Match

47.0

55.3

XQuAD-hi

Exact Match

34.5

47.1

XQuAD-zh

Exact Match

46.8

56.1

You can also prompt-learn on top of any .nemo trained checkpoint file on a SQuAD task mentioned in the section T5 and mT5 Prompt Learning.

The results are shown in the table below.

Task

Metric

390M

3B

SQuAD

Exact Match

76.86

81.55

SQuAD

F1

84.67

89.34

Training the 170M mT5 model to convergence takes 4 days. The figure below shows the loss curve.

../../_images/170M_mT5_loss_final.svg

170M mT5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 170M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).

Number of GPUs GBS Seq Length Number of Tokens Loss Throughput (Tokens/sec) Time to Train (days)
32 2048 512 1T 1.980 4,112,062 4

Training the 390M mT5 model to convergence takes 4 days. The figure below shows the loss curve.

../../_images/390M_mT5_loss_final.svg

390M mT5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 390M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).

Number of GPUs GBS Seq Length Number of Tokens Loss Throughput (Tokens/sec) Time to Train (days)
64 2048 512 1T 1.584 3,744,914 4

Training the 3B mT5 model to convergence takes 14 days. The figure below shows the loss curve of a fully trained model:

../../_images/3B_mT5_loss_final.svg

3B mT5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

Training Performance Results

Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B mT5 model)

NVIDIA measured the throughput of training a 3B parameter mT5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 14.87× speed-up.

NVIDIA is actively working on improving the scaling performance for mT5 models. The table and chart below show the performance results.

Nodes

1

2

4

5

10

20

Tokens per Second

91166

179583

346263

429088

798570

1303767

3B

Perfect Linear Scaling (Tokens)

91166

182331

364663

455829

911657

1823314

Speed-up

1x

1.97x

3.8x

4.71x

8.76x

14.3x

../../_images/3B_mT5_throughput_2208.svg

3B mT5 NeMo Framework Throughput

Inference Performance

Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB).

Inference configurations:

  • Batch size: 1

  • Input tokens length: 60

  • Output tokens length: 20

../../_images/infer_model_size_mt5.svg

Average Latency vs mT5 Model Size

mT5 Model size

Average latency [ms]

TP

PP

GPUs

380M

35

1

1

1

3B

102

2

1

2

11B

134

4

1

4

23B

230

4

1

4