Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
T5 Results
Training Accuracy Results
You can also prompt-learn on top of any .nemo
trained checkpoint file on the SQuAD
task mentioned in the section T5 and mT5 Prompt Learning. The results are shown in the table below.
Task |
Metric |
220M |
3B |
---|---|---|---|
SQuAD |
Exact Match |
74.20 |
78.52 |
SQuAD |
F1 |
84.54 |
87.17 |
Training the 220M T5 model to convergence takes 4 days, and the loss curve is shown in the figure below:
The table below shows the converged training loss, the throughput, and the total time to train for the 220M T5 model, using a given number of GPUs and a given Global Batch Size (GBS).
Number of GPUs | GBS | Seq Length | Number of Tokens | Loss | Throughput (Tokens/sec) | Time to Train (days) |
---|---|---|---|---|---|---|
32 | 2048 | 512 | 1T | 1.501 | 3,273,728 | 4 |
Training the 3B T5 model to convergence takes 11 days, and the loss curve of a fully trained model can be seen in the figure below:
The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).
#GPUs | GBS | Seq Length | #Tokens | Loss | Throughput (Tokens/sec) | Time to Train (days) |
---|---|---|---|---|---|---|
160 | 2160 | 512 | 1T | 1.147 | 1,395,131 | 11 |
Training Performance Results
Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B T5 Model)
NVIDIA measured the throughput of training a 3B parameter T5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 16.38× speed-up.
NVIDIA is actively working on improving the scaling performance for T5 models. The table and chart below show the performance results.
Nodes |
|||||||
---|---|---|---|---|---|---|---|
1 |
2 |
4 |
5 |
10 |
20 |
||
Tokens per Second |
110769 |
215579 |
417644 |
515100 |
957506 |
1626353 |
|
3B |
Perfect Linear Scaling (Tokens) |
110769 |
221538 |
443077 |
553846 |
1107692 |
2215385 |
Speed-up |
1x |
1.95x |
3.77x |
4.65x |
8.64x |
14.68x |
Inference Performance
Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB). The results are shown in the table below.
Inference configurations:
Batch size: 1
Input tokens length: 60
Output tokens length: 20
T5 Model size |
Average latency [ms] |
TP |
PP |
GPUs |
---|---|---|---|---|
3B |
94 |
2 |
1 |
2 |
11B |
123 |
4 |
1 |
4 |
23B |
213 |
4 |
1 |
4 |
41B |
332 |
8 |
1 |
8 |