MT5 Results

Training accuracy: NVIDIA DGX SuperPOD

  • 4 × 8 × A100 80GB for 170M mT5 model

  • 8 × 8 × A100 80GB for 390M mT5 model

  • 20 × 8 × A100 80GB for 3B mT5 model)

NVIDIA evaluated the mT5 models on an XQuAD task. The results are shown in the table below. You can fine-tune on top of any .nemo-trained checkpoint file on an XQuAD task mentioned in the section mT5 Fine-Tuning.

Task-Language

Metric

170M

390M

XQuAD-de Exact Match 43.0 54.7
XQuAD-en Exact Match 63.8 68.8
XQuAD-es Exact Match 47.0 55.3
XQuAD-hi Exact Match 34.5 47.1
XQuAD-zh Exact Match 46.8 56.1

You can also prompt-learn on top of any .nemo trained checkpoint file on a SQuAD task mentioned in the section T5 and mT5 Prompt Learning.

The results are shown in the table below.

Task

Metric

390M

3B

SQuAD Exact Match 76.86 81.55
SQuAD F1 84.67 89.34

Training the 170M mT5 model to convergence takes 4 days. The figure below shows the loss curve.

170M_mT5_loss_final.svg

170M mT5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 170M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).

32 2048 512 1T 1.980 4,112,062 4

Training the 390M mT5 model to convergence takes 4 days. The figure below shows the loss curve.

390M_mT5_loss_final.svg

390M mT5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 390M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).

64 2048 512 1T 1.584 3,744,914 4

Training the 3B mT5 model to convergence takes 14 days. The figure below shows the loss curve of a fully trained model:

3B_mT5_loss_final.svg

3B mT5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B mT5 model)

NVIDIA measured the throughput of training a 3B parameter mT5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 14.87× speed-up.

NVIDIA is actively working on improving the scaling performance for mT5 models. The table and chart below show the performance results.

Nodes

1 2 4 5 10 20
Tokens per Second 91166 179583 346263 429088 798570 1303767
3B Perfect Linear Scaling (Tokens) 91166 182331 364663 455829 911657 1823314
Speed-up 1x 1.97x 3.8x 4.71x 8.76x 14.3x
3B_mT5_throughput_2208.svg

3B mT5 NeMo Framework Throughput

Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB).

Inference configurations:

  • Batch size: 1

  • Input tokens length: 60

  • Output tokens length: 20

infer_model_size_mt5.svg

Average Latency vs mT5 Model Size

mT5 Model size

Average latency [ms]

TP

PP

GPUs

380M 35 1 1 1
3B 102 2 1 2
11B 134 4 1 4
23B 230 4 1 4
Previous Model Fine-Tuning
Next GPT
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.