T5 Results
Training Accuracy Results
You can also promptlearn on top of any .nemo
trained checkpoint file on the SQuAD
task mentioned in the section T5 and mT5 Prompt Learning. The results are shown in the table below.
Task 
Metric 
220M 
3B 

SQuAD 
Exact Match 
74.20 
78.52 
SQuAD 
F1 
84.54 
87.17 
Training the 220M T5 model to convergence takes 4 days, and the loss curve is shown in the figure below:
The table below shows the converged training loss, the throughput, and the total time to train for the 220M T5 model, using a given number of GPUs and a given Global Batch Size (GBS).
Number of GPUs  GBS  Seq Length  Number of Tokens  Loss  Throughput (Tokens/sec)  Time to Train (days) 

32  2048  512  1T  1.501  3,273,728  4 
Training the 3B T5 model to convergence takes 11 days, and the loss curve of a fully trained model can be seen in the figure below:
The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).
#GPUs  GBS  Seq Length  #Tokens  Loss  Throughput (Tokens/sec)  Time to Train (days) 

160  2160  512  1T  1.147  1,395,131  11 
Training Performance Results
Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B T5 Model)
NVIDIA measured the throughput of training a 3B parameter T5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 16.38× speedup.
NVIDIA is actively working on improving the scaling performance for T5 models. The table and chart below show the performance results.
Nodes 


1 
2 
4 
5 
10 
20 

Tokens per Second 
110769 
215579 
417644 
515100 
957506 
1626353 

3B 
Perfect Linear Scaling (Tokens) 
110769 
221538 
443077 
553846 
1107692 
2215385 
Speedup 
1x 
1.95x 
3.77x 
4.65x 
8.64x 
14.68x 
Inference Performance
Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB). The results are shown in the table below.
Inference configurations:
Batch size: 1
Input tokens length: 60
Output tokens length: 20
T5 Model size 
Average latency [ms] 
TP 
PP 
GPUs 

3B 
94 
2 
1 
2 
11B 
123 
4 
1 
4 
23B 
213 
4 
1 
4 
41B 
332 
8 
1 
8 