Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
GPT Results
Training Accuracy Results
Training accuracy, NVIDIA DGX SuperPOD™:
8 × 8 × A100 80GB for 126M GPT Model
16 × 8 × A100 80GB for 5B GPT Model
NVIDIA evaluated the 126M parameter and 5B parameter models on eight different language tasks. The results are shown in the table below. All of the tasks are provided as part of the evaluation harness, so you can evaluate any .nemo
checkpoint file on all of them.
Task |
Metric |
126M |
5B |
---|---|---|---|
Lambada |
Accuracy |
38.70% |
68.93% |
PPL |
25.8 |
4.22 |
|
Boolq |
Accuracy |
56.94% |
65.29% |
Race |
Accuracy |
28.71% |
38.66% |
Accuracy Norm |
34.74% |
41.62% |
|
Piqa |
Accuracy |
61.21% |
73.88% |
Accuracy Norm |
61.97% |
75.40% |
|
Hellaswag |
Accuracy |
28.48% |
46.45% |
Accuracy Norm |
29.54% |
60.85% |
|
Winogrande |
Accuracy |
50.43% |
60.77% |
Wikitext2 |
Word PPL |
31.35 |
12.36 |
Byte PPL |
1.9 |
1.6 |
|
Bits per Byte PPL |
0.64 |
0.47 |
|
Wikitext103 |
Word PPL |
31.35 |
12.36 |
Byte PPL |
1.9 |
1.6 |
|
Bits per Byte PPL |
0.64 |
0.47 |
Training the 5B GPT model to convergence takes 6.5 days, and the loss curve is shown in the figure below.
The table below shows the converged training loss, the throughput, and the total time to train for the 5B GPT model, using a given number of GPUs and a given Global Batch Size (GBS).
Number of GPUs | GBS | Seq Length | Number of tokens | Loss | Throughput (Tokens/sec) | Time to Train (days) |
---|---|---|---|---|---|---|
160 | 1440 | 2048 | 300B | 1.685 | 726,384 | 4.8 |
Training Performance Results
Training performance:
NVIDIA DGX SuperPOD (16 × 8 × A100 80GB for 5B GPT model)
NVIDIA DGX SuperPODs (128 × 8 × A100 80GB for 175B GPT model)
NVIDIA measured the throughput of training 5B and 175B parameter GPT models on different numbers of DGX nodes and achieved near-linear scaling. For example, scaling from 1 node to 32 nodes with a 5B model yielded a 28.73x speed-up. Scaling from 8 nodes to 128 nodes (16 × more) with a 175B model yielded a 14.62 × speed-up. The tables and charts below show the performance results.
Nodes |
|||||||
---|---|---|---|---|---|---|---|
1 |
2 |
4 |
8 |
16 |
32 |
||
5B |
Tokens per Second |
40345 |
79815 |
161754 |
312774 |
659481 |
1159288 |
Perfect Linear Scaling (Tokens) |
40345 |
80690 |
161380 |
322760 |
645520 |
1291040 |
|
Speed-up |
1x |
1.98x |
4.01x |
7.75x |
16.35x |
28.73x |
Nodes |
||||||
---|---|---|---|---|---|---|
8 |
16 |
32 |
64 |
128 |
||
Tokens per Second |
7500 |
14950 |
29537 |
58211 |
109684 |
|
175B |
Perfect Linear Scaling (Tokens) |
7500 |
15000 |
30000 |
60000 |
120000 |
Speed-up |
1x |
1.99x |
3.94x |
7.76x |
14.62x |
Inference Performance
Inference performance was measured for - (1- 8 × A100 80GB SXM4) - (1- 8 × H100 80GB HBM3)
Configuration 1: Chatbot Conversation use case
- batch size: 1 - 8
input tokens length: 128
output tokens length: 20
Model size |
Batch Size |
Average Latency [ms] |
Average Throughput [sentences/s] |
TP |
PP |
GPUs |
||
---|---|---|---|---|---|---|---|---|
A100 80GB SXM4 |
H100 80GB HBM3 |
A100 80GB SXM4 |
H100 80GB HBM3 |
|||||
8B |
1 |
238.6 |
151.9 |
4.2 |
6.6 |
1 |
1 |
1 |
8B |
2 |
247.9 |
156.9 |
8.1 |
12.7 |
1 |
1 |
1 |
8B |
4 |
273.4 |
165.5 |
14.6 |
24.2 |
1 |
1 |
1 |
8B |
8 |
321.9 |
188.2 |
24.9 |
42.5 |
1 |
1 |
1 |
8B |
1 |
170.6 |
117.4 |
5.9 |
8.5 |
2 |
1 |
2 |
8B |
2 |
176.0 |
120.1 |
11.4 |
16.6 |
2 |
1 |
2 |
8B |
4 |
191.0 |
126.1 |
20.9 |
31.7 |
2 |
1 |
2 |
8B |
8 |
226.6 |
141.1 |
35.3 |
56.7 |
2 |
1 |
2 |
8B |
1 |
131.8 |
97.3 |
7.6 |
10.3 |
4 |
1 |
4 |
8B |
2 |
136.3 |
102.0 |
14.7 |
19.6 |
4 |
1 |
4 |
8B |
4 |
147.7 |
107.2 |
27.1 |
37.3 |
4 |
1 |
4 |
8B |
8 |
171.5 |
119.2 |
46.7 |
67.1 |
4 |
1 |
4 |
8B |
1 |
121.0 |
88.7 |
8.3 |
11.3 |
8 |
1 |
8 |
8B |
2 |
127.7 |
95.7 |
15.7 |
20.9 |
8 |
1 |
8 |
8B |
4 |
140.3 |
102.0 |
28.5 |
39.2 |
8 |
1 |
8 |
8B |
8 |
160.4 |
112.8 |
49.9 |
70.9 |
8 |
1 |
8 |
43B |
1 |
631.2 |
395.1 |
1.6 |
2.5 |
2 |
1 |
2 |
43B |
2 |
668.4 |
402.3 |
3.0 |
5.0 |
2 |
1 |
2 |
43B |
4 |
735.2 |
424.6 |
5.4 |
9.4 |
2 |
1 |
2 |
43B |
8 |
854.5 |
477.1 |
9.4 |
16.8 |
2 |
1 |
2 |
43B |
1 |
394.9 |
258.2 |
2.5 |
3.9 |
4 |
1 |
4 |
43B |
2 |
412.3 |
261.0 |
4.9 |
7.7 |
4 |
1 |
4 |
43B |
4 |
448.2 |
275.9 |
8.9 |
14.5 |
4 |
1 |
4 |
43B |
8 |
523.7 |
308.7 |
15.3 |
25.9 |
4 |
1 |
4 |
43B |
1 |
301.0 |
210.9 |
3.3 |
4.7 |
8 |
1 |
8 |
43B |
2 |
314.7 |
213.4 |
6.4 |
9.4 |
8 |
1 |
8 |
43B |
4 |
343.1 |
223.4 |
11.7 |
17.9 |
8 |
1 |
8 |
43B |
8 |
384.7 |
247.4 |
20.8 |
32.3 |
8 |
1 |
8 |
Configuration 2: Translation / Style Transfer use case
- batch size: 1 - 8
input tokens length: 200
output tokens length: 200
Model size |
Batch Size |
Average Latency [ms] |
Average Throughput [sentences/s] |
TP |
PP |
GPUs |
||
---|---|---|---|---|---|---|---|---|
A100 80GB SXM4 |
H100 80GB HBM3 |
A100 80GB SXM4 |
H100 80GB HBM3 |
|||||
8B |
1 |
2,290.6 |
1,435.7 |
0.4 |
0.7 |
1 |
1 |
1 |
8B |
2 |
2,325.4 |
1,468.8 |
0.9 |
1.4 |
1 |
1 |
1 |
8B |
4 |
2,478.7 |
1,506.3 |
1.6 |
2.7 |
1 |
1 |
1 |
8B |
8 |
2,693.7 |
1,644.4 |
3.0 |
4.9 |
1 |
1 |
1 |
8B |
1 |
1,558.9 |
1,047.4 |
0.6 |
1.0 |
2 |
1 |
2 |
8B |
2 |
1,597.4 |
1,066.8 |
1.9 |
1.9 |
2 |
1 |
2 |
8B |
4 |
1,653.7 |
1,095.3 |
2.4 |
3.7 |
2 |
1 |
2 |
8B |
8 |
1,823.3 |
1,155.6 |
4.4 |
6.9 |
2 |
1 |
2 |
8B |
1 |
1,167.3 |
849.8 |
0.9 |
1.2 |
4 |
1 |
4 |
8B |
2 |
1,202.9 |
892.0 |
1.7 |
2.2 |
4 |
1 |
4 |
8B |
4 |
1,260.3 |
915.3 |
3.2 |
4.4 |
4 |
1 |
4 |
8B |
8 |
1,329.1 |
968.7 |
6.0 |
8.3 |
4 |
1 |
4 |
8B |
1 |
1,057.8 |
747.6 |
0.9 |
1.3 |
8 |
1 |
8 |
8B |
2 |
1,110.5 |
819.4 |
1.8 |
2.4 |
8 |
1 |
8 |
8B |
4 |
1,187.1 |
855.9 |
3.4 |
4.7 |
8 |
1 |
8 |
8B |
8 |
1,268.1 |
900.2 |
6.3 |
8.9 |
8 |
1 |
8 |
43B |
1 |
6,117.2 |
3,817.2 |
0.2 |
0.3 |
2 |
1 |
2 |
43B |
2 |
6,375.8 |
3,856.8 |
0.3 |
0.5 |
2 |
1 |
2 |
43B |
4 |
6,616.7 |
3,919.8 |
0.6 |
1.0 |
2 |
1 |
2 |
43B |
8 |
7,026.5 |
4,141.1 |
1.1 |
1.9 |
2 |
1 |
2 |
43B |
1 |
3,754.8 |
2,437.0 |
0.3 |
0.4 |
4 |
1 |
4 |
43B |
2 |
3,877.3 |
2,442.7 |
0.5 |
0.8 |
4 |
1 |
4 |
43B |
4 |
3,974.5 |
2,503.3 |
1.0 |
1.6 |
4 |
1 |
4 |
43B |
8 |
4,275.2 |
2,593.0 |
1.9 |
3.1 |
4 |
1 |
4 |
43B |
1 |
2,810.5 |
1,953.9 |
0.4 |
0.5 |
8 |
1 |
8 |
43B |
2 |
2,902.4 |
1,961.9 |
0.7 |
1.0 |
8 |
1 |
8 |
43B |
4 |
3,024.5 |
2,000.7 |
1.3 |
2.0 |
8 |
1 |
8 |
43B |
8 |
3,126.1 |
2,082.8 |
2.6 |
3.8 |
8 |
1 |
8 |