Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
MT5 Results
Training Accuracy Results
Training accuracy: NVIDIA DGX SuperPOD
4 × 8 × A100 80GB for 170M mT5 model
8 × 8 × A100 80GB for 390M mT5 model
20 × 8 × A100 80GB for 3B mT5 model)
NVIDIA evaluated the mT5 models on an XQuAD task. The results are shown in the table below. You can fine-tune on top of any .nemo
-trained checkpoint file on an XQuAD
task mentioned in the section mT5 Fine-Tuning.
Task-Language |
Metric |
170M |
390M |
---|---|---|---|
XQuAD-de |
Exact Match |
43.0 |
54.7 |
XQuAD-en |
Exact Match |
63.8 |
68.8 |
XQuAD-es |
Exact Match |
47.0 |
55.3 |
XQuAD-hi |
Exact Match |
34.5 |
47.1 |
XQuAD-zh |
Exact Match |
46.8 |
56.1 |
You can also prompt-learn on top of any .nemo
trained checkpoint file on a SQuAD
task mentioned in the section T5 and mT5 Prompt Learning.
The results are shown in the table below.
Task |
Metric |
390M |
3B |
---|---|---|---|
SQuAD |
Exact Match |
76.86 |
81.55 |
SQuAD |
F1 |
84.67 |
89.34 |
Training the 170M mT5 model to convergence takes 4 days. The figure below shows the loss curve.
The table below shows the converged training loss, the throughput, and the total time to train for the 170M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).
Number of GPUs | GBS | Seq Length | Number of Tokens | Loss | Throughput (Tokens/sec) | Time to Train (days) |
---|---|---|---|---|---|---|
32 | 2048 | 512 | 1T | 1.980 | 4,112,062 | 4 |
Training the 390M mT5 model to convergence takes 4 days. The figure below shows the loss curve.
The table below shows the converged training loss, the throughput, and the total time to train for the 390M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).
Number of GPUs | GBS | Seq Length | Number of Tokens | Loss | Throughput (Tokens/sec) | Time to Train (days) |
---|---|---|---|---|---|---|
64 | 2048 | 512 | 1T | 1.584 | 3,744,914 | 4 |
Training the 3B mT5 model to convergence takes 14 days. The figure below shows the loss curve of a fully trained model:
The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).
Training Performance Results
Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B mT5 model)
NVIDIA measured the throughput of training a 3B parameter mT5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 14.87× speed-up.
NVIDIA is actively working on improving the scaling performance for mT5 models. The table and chart below show the performance results.
Nodes |
|||||||
---|---|---|---|---|---|---|---|
1 |
2 |
4 |
5 |
10 |
20 |
||
Tokens per Second |
91166 |
179583 |
346263 |
429088 |
798570 |
1303767 |
|
3B |
Perfect Linear Scaling (Tokens) |
91166 |
182331 |
364663 |
455829 |
911657 |
1823314 |
Speed-up |
1x |
1.97x |
3.8x |
4.71x |
8.76x |
14.3x |
Inference Performance
Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB).
Inference configurations:
Batch size: 1
Input tokens length: 60
Output tokens length: 20
mT5 Model size |
Average latency [ms] |
TP |
PP |
GPUs |
---|---|---|---|---|
380M |
35 |
1 |
1 |
1 |
3B |
102 |
2 |
1 |
2 |
11B |
134 |
4 |
1 |
4 |
23B |
230 |
4 |
1 |
4 |