Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
T5 Results
Training Accuracy Results
You can also prompt-learn on top of any .nemo trained checkpoint file on the SQuAD task mentioned in the section T5 and mT5 Prompt Learning. The results are shown in the table below.
| Task | Metric | 220M | 3B | 
|---|---|---|---|
| SQuAD | Exact Match | 74.20 | 78.52 | 
| SQuAD | F1 | 84.54 | 87.17 | 
Training the 220M T5 model to convergence takes 4 days, and the loss curve is shown in the figure below:
220M T5 Training Loss
The table below shows the converged training loss, the throughput, and the total time to train for the 220M T5 model, using a given number of GPUs and a given Global Batch Size (GBS).
| Number of GPUs | GBS | Seq Length | Number of Tokens | Loss | Throughput (Tokens/sec) | Time to Train (days) | 
|---|---|---|---|---|---|---|
| 32 | 2048 | 512 | 1T | 1.501 | 3,273,728 | 4 | 
Training the 3B T5 model to convergence takes 11 days, and the loss curve of a fully trained model can be seen in the figure below:
3B T5 Training Loss
The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).
| #GPUs | GBS | Seq Length | #Tokens | Loss | Throughput (Tokens/sec) | Time to Train (days) | 
|---|---|---|---|---|---|---|
| 160 | 2160 | 512 | 1T | 1.147 | 1,395,131 | 11 | 
Training Performance Results
Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B T5 Model)
NVIDIA measured the throughput of training a 3B parameter T5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 16.38× speed-up.
NVIDIA is actively working on improving the scaling performance for T5 models. The table and chart below show the performance results.
| Nodes | |||||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 4 | 5 | 10 | 20 | ||
| Tokens per Second | 110769 | 215579 | 417644 | 515100 | 957506 | 1626353 | |
| 3B | Perfect Linear Scaling (Tokens) | 110769 | 221538 | 443077 | 553846 | 1107692 | 2215385 | 
| Speed-up | 1x | 1.95x | 3.77x | 4.65x | 8.64x | 14.68x | 
3B T5 NeMo Framework Throughput
Inference Performance
Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB). The results are shown in the table below.
Inference configurations:
- Batch size: 1 
- Input tokens length: 60 
- Output tokens length: 20 
Average Latency vs T5 Model Size
| T5 Model size | Average latency [ms] | TP | PP | GPUs | 
|---|---|---|---|---|
| 3B | 94 | 2 | 1 | 2 | 
| 11B | 123 | 4 | 1 | 4 | 
| 23B | 213 | 4 | 1 | 4 | 
| 41B | 332 | 8 | 1 | 8 |