Training Accuracy Results
Training accuracy, NVIDIA DGX SuperPOD™:
8 × 8 × A100 80GB for 126M GPT Model
16 × 8 × A100 80GB for 5B GPT Model
NVIDIA evaluated the 126M parameter and 5B parameter models on eight different language tasks. The results are shown in the table below. All of the tasks are provided as part of the evaluation harness, so you can evaluate any .nemo
checkpoint file on all of them.
Task |
Metric |
126M |
5B |
---|---|---|---|
Lambada |
Accuracy |
38.70% |
68.93% |
PPL |
25.8 |
4.22 |
|
Boolq |
Accuracy |
56.94% |
65.29% |
Race |
Accuracy |
28.71% |
38.66% |
Accuracy Norm |
34.74% |
41.62% |
|
Piqa |
Accuracy |
61.21% |
73.88% |
Accuracy Norm |
61.97% |
75.40% |
|
Hellaswag |
Accuracy |
28.48% |
46.45% |
Accuracy Norm |
29.54% |
60.85% |
|
Winogrande |
Accuracy |
50.43% |
60.77% |
Wikitext2 |
Word PPL |
31.35 |
12.36 |
Byte PPL |
1.9 |
1.6 |
|
Bits per Byte PPL |
0.64 |
0.47 |
|
Wikitext103 |
Word PPL |
31.35 |
12.36 |
Byte PPL |
1.9 |
1.6 |
|
Bits per Byte PPL |
0.64 |
0.47 |
Training the 5B GPT model to convergence takes 6.5 days, and the loss curve is shown in the figure below.
5B GPT Training Loss
The table below shows the converged training loss, the throughput, and the total time to train for the 5B GPT model, using a given number of GPUs and a given Global Batch Size (GBS).
160 | 1440 | 2048 | 300B | 1.685 | 726,384 | 4.8 |
Training Performance Results
Training performance:
NVIDIA DGX SuperPOD (16 × 8 × A100 80GB for 5B GPT model)
NVIDIA DGX SuperPODs (128 × 8 × A100 80GB for 175B GPT model)
NVIDIA measured the throughput of training 5B and 175B parameter GPT models on different numbers of DGX nodes and achieved near-linear scaling. For example, scaling from 1 node to 32 nodes with a 5B model yielded a 28.73x speed-up. Scaling from 8 nodes to 128 nodes (16 × more) with a 175B model yielded a 14.62 × speed-up. The tables and charts below show the performance results.
Nodes |
|||||||
---|---|---|---|---|---|---|---|
1 |
2 |
4 |
8 |
16 |
32 |
||
Tokens per Second |
40345 |
79815 |
161754 |
312774 |
659481 |
1159288 |
|
5B |
Perfect Linear Scaling (Tokens) |
40345 |
80690 |
161380 |
322760 |
645520 |
1291040 |
Speed-up |
1x |
1.98x |
4.01x |
7.75x |
16.35x |
28.73x |
5B GPT NeMo Framework Throughput
Nodes |
||||||
---|---|---|---|---|---|---|
8 |
16 |
32 |
64 |
128 |
||
Tokens per Second |
7500 |
14950 |
29537 |
58211 |
109684 |
|
175B |
Perfect Linear Scaling (Tokens) |
7500 |
15000 |
30000 |
60000 |
120000 |
Speed-up |
1x |
1.99x |
3.94x |
7.76x |
14.62x |
175B GPT NeMo Framework Throughput
Inference Performance
Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB).
Inference configurations:
batch size: 1
input tokens length: 60
output tokens length: 20
Average Latency vs GPT Model Size
GPT Model size |
Average latency [ms] |
TP |
PP |
GPUs |
---|---|---|---|---|
5B |
87 |
8 |
4 |
32 |
20B |
202 |
8 |
4 |
32 |
175B |
893 |
8 |
4 |
32 |
530B |
977 |
32 |
1 |
32 |
T5 Results
Training Accuracy Results
You can also prompt-learn on top of any .nemo
trained checkpoint file on the SQuAD
task mentioned in the section T5 and mT5 Prompt Learning. The results are shown in the table below.
Task |
Metric |
220M |
3B |
---|---|---|---|
SQuAD |
Exact Match |
74.20 |
78.52 |
SQuAD |
F1 |
84.54 |
87.17 |
Training the 220M T5 model to convergence takes 4 days, and the loss curve is shown in the figure below:
220M T5 Training Loss
The table below shows the converged training loss, the throughput, and the total time to train for the 220M T5 model, using a given number of GPUs and a given Global Batch Size (GBS).
32 | 2048 | 512 | 1T | 1.501 | 3,273,728 | 4 |
Training the 3B T5 model to convergence takes 11 days, and the loss curve of a fully trained model can be seen in the figure below:
3B T5 Training Loss
The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).
160 | 2160 | 512 | 1T | 1.147 | 1,395,131 | 11 |
Training Performance Results
Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B T5 Model)
NVIDIA measured the throughput of training a 3B parameter T5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 16.38× speed-up.
NVIDIA is actively working on improving the scaling performance for T5 models. The table and chart below show the performance results.
Nodes |
|||||||
---|---|---|---|---|---|---|---|
1 |
2 |
4 |
5 |
10 |
20 |
||
Tokens per Second |
110769 |
215579 |
417644 |
515100 |
957506 |
1626353 |
|
3B |
Perfect Linear Scaling (Tokens) |
110769 |
221538 |
443077 |
553846 |
1107692 |
2215385 |
Speed-up |
1x |
1.95x |
3.77x |
4.65x |
8.64x |
14.68x |
3B T5 NeMo Framework Throughput
Inference Performance
Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB). The results are shown in the table below.
Inference configurations:
Batch size: 1
Input tokens length: 60
Output tokens length: 20
Average Latency vs T5 Model Size
T5 Model size |
Average latency [ms] |
TP |
PP |
GPUs |
---|---|---|---|---|
3B |
94 |
2 |
1 |
2 |
11B |
123 |
4 |
1 |
4 |
23B |
213 |
4 |
1 |
4 |
41B |
332 |
8 |
1 |
8 |
mT5 Results
Training Accuracy Results
Training accuracy: NVIDIA DGX SuperPOD
4 × 8 × A100 80GB for 170M mT5 model
8 × 8 × A100 80GB for 390M mT5 model
20 × 8 × A100 80GB for 3B mT5 model)
NVIDIA evaluated the mT5 models on an XQuAD task. The results are shown in the table below. You can fine-tune on top of any .nemo
-trained checkpoint file on an XQuAD
task mentioned in the section mT5 Fine-Tuning.
Task-Language |
Metric |
170M |
390M |
---|---|---|---|
XQuAD-de |
Exact Match |
43.0 |
54.7 |
XQuAD-en |
Exact Match |
63.8 |
68.8 |
XQuAD-es |
Exact Match |
47.0 |
55.3 |
XQuAD-hi |
Exact Match |
34.5 |
47.1 |
XQuAD-zh |
Exact Match |
46.8 |
56.1 |
You can also prompt-learn on top of any .nemo
trained checkpoint file on a SQuAD
task mentioned in the section T5 and mT5 Prompt Learning.
The results are shown in the table below.
Task |
Metric |
390M |
3B |
---|---|---|---|
SQuAD |
Exact Match |
76.86 |
81.55 |
SQuAD |
F1 |
84.67 |
89.34 |
Training the 170M mT5 model to convergence takes 4 days. The figure below shows the loss curve.
170M mT5 Training Loss
The table below shows the converged training loss, the throughput, and the total time to train for the 170M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).
32 | 2048 | 512 | 1T | 1.980 | 4,112,062 | 4 |
Training the 390M mT5 model to convergence takes 4 days. The figure below shows the loss curve.
390M mT5 Training Loss
The table below shows the converged training loss, the throughput, and the total time to train for the 390M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).
64 | 2048 | 512 | 1T | 1.584 | 3,744,914 | 4 |
Training the 3B mT5 model to convergence takes 14 days. The figure below shows the loss curve of a fully trained model:
3B mT5 Training Loss
The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).
169 | 1920 | 512 | 1T | 1.134 | 911,065 | 14 |
Training Performance Results
Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B mT5 model)
NVIDIA measured the throughput of training a 3B parameter mT5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 14.87× speed-up.
NVIDIA is actively working on improving the scaling performance for mT5 models. The table and chart below show the performance results.
Nodes |
|||||||
---|---|---|---|---|---|---|---|
1 |
2 |
4 |
5 |
10 |
20 |
||
Tokens per Second |
91166 |
179583 |
346263 |
429088 |
798570 |
1303767 |
|
3B |
Perfect Linear Scaling (Tokens) |
91166 |
182331 |
364663 |
455829 |
911657 |
1823314 |
Speed-up |
1x |
1.97x |
3.8x |
4.71x |
8.76x |
14.3x |
3B mT5 NeMo Framework Throughput
Inference Performance
Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB).
Inference configurations:
Batch size: 1
Input tokens length: 60
Output tokens length: 20
Average Latency vs mT5 Model Size
mT5 Model size |
Average latency [ms] |
TP |
PP |
GPUs |
---|---|---|---|---|
380M |
35 |
1 |
1 |
1 |
3B |
102 |
2 |
1 |
2 |
11B |
134 |
4 |
1 |
4 |
23B |
230 |
4 |
1 |
4 |
BERT Results
Training Accuracy Results
Training accuracy: NVIDIA DGX SuperPOD (16 × 8 × A100 80GB for 4b BERT model)
Training the 4B BERT model for 95 Billion takes 1.5 days. The figure below shows the loss curve.

4B BERT Training Loss (220B Tokens)
The table below shows the converged training loss, the throughput, and the total time to train for the 4B BERT model, using a given number of GPUs and a given Global Batch Size (GBS).
Training Performance Results
Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 4B BERT model)
NVIDIA measured the throughput of training a 4B parameter BERT model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 16 nodes yielded a 12.71× speed-up. The table and chart below show the performance results.
Nodes |
||||||
---|---|---|---|---|---|---|
1 |
2 |
4 |
8 |
16 |
||
Tokens per Second |
57287 |
108695 |
215358 |
393167 |
728178 |
|
4B |
Perfect Linear Scaling (Tokens) |
57287 |
114574 |
229148 |
458296 |
916592 |
Speed-up |
1x |
1.89x |
3.75x |
6.86x |
12.71x |

4B BERT NeMo Framework Throughput