GPT Results

User Guide (Latest Version)

Training accuracy, NVIDIA DGX SuperPOD™:

  • 8 × 8 × A100 80GB for 126M GPT Model

  • 16 × 8 × A100 80GB for 5B GPT Model

NVIDIA evaluated the 126M parameter and 5B parameter models on eight different language tasks. The results are shown in the table below. All of the tasks are provided as part of the evaluation harness, so you can evaluate any .nemo checkpoint file on all of them.

Task

Metric

126M

5B

Lambada Accuracy 38.70% 68.93%

PPL 25.8 4.22
Boolq Accuracy 56.94% 65.29%
Race Accuracy 28.71% 38.66%

Accuracy Norm 34.74% 41.62%
Piqa Accuracy 61.21% 73.88%

Accuracy Norm 61.97% 75.40%
Hellaswag Accuracy 28.48% 46.45%

Accuracy Norm 29.54% 60.85%
Winogrande Accuracy 50.43% 60.77%
Wikitext2 Word PPL 31.35 12.36

Byte PPL 1.9 1.6

Bits per Byte PPL 0.64 0.47
Wikitext103 Word PPL 31.35 12.36

Byte PPL 1.9 1.6

Bits per Byte PPL 0.64 0.47

Training the 5B GPT model to convergence takes 6.5 days, and the loss curve is shown in the figure below.

5B_GPT_3_loss_final.svg

5B GPT Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 5B GPT model, using a given number of GPUs and a given Global Batch Size (GBS).

160 1440 2048 300B 1.685 726,384 4.8

Training performance:

  • NVIDIA DGX SuperPOD (16 × 8 × A100 80GB for 5B GPT model)

  • NVIDIA DGX SuperPODs (128 × 8 × A100 80GB for 175B GPT model)

NVIDIA measured the throughput of training 5B and 175B parameter GPT models on different numbers of DGX nodes and achieved near-linear scaling. For example, scaling from 1 node to 32 nodes with a 5B model yielded a 28.73x speed-up. Scaling from 8 nodes to 128 nodes (16 × more) with a 175B model yielded a 14.62 × speed-up. The tables and charts below show the performance results.

Nodes

1 2 4 8 16 32
5B Tokens per Second 40345 79815 161754 312774 659481 1159288
Perfect Linear Scaling (Tokens) 40345 80690 161380 322760 645520 1291040
Speed-up 1x 1.98x 4.01x 7.75x 16.35x 28.73x
5B_GPT_3_throughput.svg

5B GPT NeMo Framework Throughput

Nodes

8 16 32 64 128

Tokens per Second 7500 14950 29537 58211 109684
175B Perfect Linear Scaling (Tokens) 7500 15000 30000 60000 120000

Speed-up 1x 1.99x 3.94x 7.76x 14.62x
175B_GPT_3_throughput.svg

175B GPT NeMo Framework Throughput

Inference performance was measured for - (1- 8 × A100 80GB SXM4) - (1- 8 × H100 80GB HBM3)

Configuration 1: Chatbot Conversation use case

  • batch size: 1 - 8
    • input tokens length: 128

    • output tokens length: 20

infer_modelsize_scaling_gpt_128_20.svg

infer_gpu_scaling_gpt_8b_128_20.svg

infer_bs_scaling_gpt_8b_128_20.svg

Average Latency, Average Throughput, and Model Size

Model size

Batch Size

Average Latency [ms]

Average Throughput [sentences/s]

TP

PP

GPUs

A100 80GB SXM4

H100 80GB HBM3

A100 80GB SXM4

H100 80GB HBM3

8B 1 238.6 151.9 4.2 6.6 1 1 1
8B 2 247.9 156.9 8.1 12.7 1 1 1
8B 4 273.4 165.5 14.6 24.2 1 1 1
8B 8 321.9 188.2 24.9 42.5 1 1 1
8B 1 170.6 117.4 5.9 8.5 2 1 2
8B 2 176.0 120.1 11.4 16.6 2 1 2
8B 4 191.0 126.1 20.9 31.7 2 1 2
8B 8 226.6 141.1 35.3 56.7 2 1 2
8B 1 131.8 97.3 7.6 10.3 4 1 4
8B 2 136.3 102.0 14.7 19.6 4 1 4
8B 4 147.7 107.2 27.1 37.3 4 1 4
8B 8 171.5 119.2 46.7 67.1 4 1 4
8B 1 121.0 88.7 8.3 11.3 8 1 8
8B 2 127.7 95.7 15.7 20.9 8 1 8
8B 4 140.3 102.0 28.5 39.2 8 1 8
8B 8 160.4 112.8 49.9 70.9 8 1 8
43B 1 631.2 395.1 1.6 2.5 2 1 2
43B 2 668.4 402.3 3.0 5.0 2 1 2
43B 4 735.2 424.6 5.4 9.4 2 1 2
43B 8 854.5 477.1 9.4 16.8 2 1 2
43B 1 394.9 258.2 2.5 3.9 4 1 4
43B 2 412.3 261.0 4.9 7.7 4 1 4
43B 4 448.2 275.9 8.9 14.5 4 1 4
43B 8 523.7 308.7 15.3 25.9 4 1 4
43B 1 301.0 210.9 3.3 4.7 8 1 8
43B 2 314.7 213.4 6.4 9.4 8 1 8
43B 4 343.1 223.4 11.7 17.9 8 1 8
43B 8 384.7 247.4 20.8 32.3 8 1 8

Configuration 2: Translation / Style Transfer use case

  • batch size: 1 - 8
    • input tokens length: 200

    • output tokens length: 200

infer_modelsize_scaling_gpt_200_200.svg

infer_gpu_scaling_gpt_8b_200_200.svg

infer_bs_scaling_gpt_8b_200_200.svg

Average Latency, Average Throughput, and Model Size

Model size

Batch Size

Average Latency [ms]

Average Throughput [sentences/s]

TP

PP

GPUs

A100 80GB SXM4

H100 80GB HBM3

A100 80GB SXM4

H100 80GB HBM3

8B 1 2,290.6 1,435.7 0.4 0.7 1 1 1
8B 2 2,325.4 1,468.8 0.9 1.4 1 1 1
8B 4 2,478.7 1,506.3 1.6 2.7 1 1 1
8B 8 2,693.7 1,644.4 3.0 4.9 1 1 1
8B 1 1,558.9 1,047.4 0.6 1.0 2 1 2
8B 2 1,597.4 1,066.8 1.9 1.9 2 1 2
8B 4 1,653.7 1,095.3 2.4 3.7 2 1 2
8B 8 1,823.3 1,155.6 4.4 6.9 2 1 2
8B 1 1,167.3 849.8 0.9 1.2 4 1 4
8B 2 1,202.9 892.0 1.7 2.2 4 1 4
8B 4 1,260.3 915.3 3.2 4.4 4 1 4
8B 8 1,329.1 968.7 6.0 8.3 4 1 4
8B 1 1,057.8 747.6 0.9 1.3 8 1 8
8B 2 1,110.5 819.4 1.8 2.4 8 1 8
8B 4 1,187.1 855.9 3.4 4.7 8 1 8
8B 8 1,268.1 900.2 6.3 8.9 8 1 8
43B 1 6,117.2 3,817.2 0.2 0.3 2 1 2
43B 2 6,375.8 3,856.8 0.3 0.5 2 1 2
43B 4 6,616.7 3,919.8 0.6 1.0 2 1 2
43B 8 7,026.5 4,141.1 1.1 1.9 2 1 2
43B 1 3,754.8 2,437.0 0.3 0.4 4 1 4
43B 2 3,877.3 2,442.7 0.5 0.8 4 1 4
43B 4 3,974.5 2,503.3 1.0 1.6 4 1 4
43B 8 4,275.2 2,593.0 1.9 3.1 4 1 4
43B 1 2,810.5 1,953.9 0.4 0.5 8 1 8
43B 2 2,902.4 1,961.9 0.7 1.0 8 1 8
43B 4 3,024.5 2,000.7 1.3 2.0 8 1 8
43B 8 3,126.1 2,082.8 2.6 3.8 8 1 8
Previous Model Deployment
Next BERT
© | | | | | | |. Last updated on May 30, 2024.