Training accuracy: NVIDIA DGX SuperPOD
4 × 8 × A100 80GB for 170M mT5 model
8 × 8 × A100 80GB for 390M mT5 model
20 × 8 × A100 80GB for 3B mT5 model)
NVIDIA evaluated the mT5 models on an XQuAD task. The results are shown in the table below. You can fine-tune on top of any .nemo
-trained checkpoint file on an XQuAD
task mentioned in the section mT5 Fine-Tuning.
Task-Language |
Metric |
170M |
390M |
---|---|---|---|
XQuAD-de | Exact Match | 43.0 | 54.7 |
XQuAD-en | Exact Match | 63.8 | 68.8 |
XQuAD-es | Exact Match | 47.0 | 55.3 |
XQuAD-hi | Exact Match | 34.5 | 47.1 |
XQuAD-zh | Exact Match | 46.8 | 56.1 |
You can also prompt-learn on top of any .nemo
trained checkpoint file on a SQuAD
task mentioned in the section T5 and mT5 Prompt Learning.
The results are shown in the table below.
Task |
Metric |
390M |
3B |
---|---|---|---|
SQuAD | Exact Match | 76.86 | 81.55 |
SQuAD | F1 | 84.67 | 89.34 |
Training the 170M mT5 model to convergence takes 4 days. The figure below shows the loss curve.
170M mT5 Training Loss
The table below shows the converged training loss, the throughput, and the total time to train for the 170M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).
32 | 2048 | 512 | 1T | 1.980 | 4,112,062 | 4 |
Training the 390M mT5 model to convergence takes 4 days. The figure below shows the loss curve.
390M mT5 Training Loss
The table below shows the converged training loss, the throughput, and the total time to train for the 390M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).
64 | 2048 | 512 | 1T | 1.584 | 3,744,914 | 4 |
Training the 3B mT5 model to convergence takes 14 days. The figure below shows the loss curve of a fully trained model:
3B mT5 Training Loss
The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).
169 | 1920 | 512 | 1T | 1.134 | 911,065 | 14 |
Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B mT5 model)
NVIDIA measured the throughput of training a 3B parameter mT5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 14.87× speed-up.
NVIDIA is actively working on improving the scaling performance for mT5 models. The table and chart below show the performance results.
Nodes |
|||||||
---|---|---|---|---|---|---|---|
1 | 2 | 4 | 5 | 10 | 20 | ||
Tokens per Second | 91166 | 179583 | 346263 | 429088 | 798570 | 1303767 | |
3B | Perfect Linear Scaling (Tokens) | 91166 | 182331 | 364663 | 455829 | 911657 | 1823314 |
Speed-up | 1x | 1.97x | 3.8x | 4.71x | 8.76x | 14.3x |
3B mT5 NeMo Framework Throughput
Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB).
Inference configurations:
Batch size: 1
Input tokens length: 60
Output tokens length: 20
Average Latency vs mT5 Model Size
mT5 Model size |
Average latency [ms] |
TP |
PP |
GPUs |
---|---|---|---|---|
380M | 35 | 1 | 1 | 1 |
3B | 102 | 2 | 1 | 2 |
11B | 134 | 4 | 1 | 4 |
23B | 230 | 4 | 1 | 4 |