GPT Results

Training Accuracy Results

Training accuracy, NVIDIA DGX SuperPOD:

  • 8 × 8 × A100 80GB for 126M GPT Model

  • 16 × 8 × A100 80GB for 5B GPT Model

NVIDIA evaluated the 126M parameter and 5B parameter models on eight different language tasks. The results are shown in the table below. All of the tasks are provided as part of the evaluation harness, so you can evaluate any .nemo checkpoint file on all of them.

Task

Metric

126M

5B

Lambada

Accuracy

38.70%

68.93%

PPL

25.8

4.22

Boolq

Accuracy

56.94%

65.29%

Race

Accuracy

28.71%

38.66%

Accuracy Norm

34.74%

41.62%

Piqa

Accuracy

61.21%

73.88%

Accuracy Norm

61.97%

75.40%

Hellaswag

Accuracy

28.48%

46.45%

Accuracy Norm

29.54%

60.85%

Winogrande

Accuracy

50.43%

60.77%

Wikitext2

Word PPL

31.35

12.36

Byte PPL

1.9

1.6

Bits per Byte PPL

0.64

0.47

Wikitext103

Word PPL

31.35

12.36

Byte PPL

1.9

1.6

Bits per Byte PPL

0.64

0.47

Training the 5B GPT model to convergence takes 6.5 days, and the loss curve is shown in the figure below.

../../_images/5B_GPT_3_loss_final.svg

5B GPT Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 5B GPT model, using a given number of GPUs and a given Global Batch Size (GBS).

Number of GPUs GBS Seq Length Number of tokens Loss Throughput (Tokens/sec) Time to Train (days)
160 1440 2048 300B 1.685 726,384 4.8

Training Performance Results

Training performance:

  • NVIDIA DGX SuperPOD (16 × 8 × A100 80GB for 5B GPT model)

  • NVIDIA DGX SuperPODs (128 × 8 × A100 80GB for 175B GPT model)

NVIDIA measured the throughput of training 5B and 175B parameter GPT models on different numbers of DGX nodes and achieved near-linear scaling. For example, scaling from 1 node to 32 nodes with a 5B model yielded a 28.73x speed-up. Scaling from 8 nodes to 128 nodes (16 × more) with a 175B model yielded a 14.62 × speed-up. The tables and charts below show the performance results.

Nodes

1

2

4

8

16

32

5B

Tokens per Second

40345

79815

161754

312774

659481

1159288

Perfect Linear Scaling (Tokens)

40345

80690

161380

322760

645520

1291040

Speed-up

1x

1.98x

4.01x

7.75x

16.35x

28.73x

../../_images/5B_GPT_3_throughput.svg

5B GPT NeMo Framework Throughput

Nodes

8

16

32

64

128

Tokens per Second

7500

14950

29537

58211

109684

175B

Perfect Linear Scaling (Tokens)

7500

15000

30000

60000

120000

Speed-up

1x

1.99x

3.94x

7.76x

14.62x

../../_images/175B_GPT_3_throughput.svg

175B GPT NeMo Framework Throughput

Inference Performance

Inference performance was measured for - (1- 8 × A100 80GB SXM4) - (1- 8 × H100 80GB HBM3)

Configuration 1: Chatbot Conversation use case

  • batch size: 1 - 8
    • input tokens length: 128

    • output tokens length: 20

../../_images/infer_modelsize_scaling_gpt_128_20.svg
../../_images/infer_gpu_scaling_gpt_8b_128_20.svg
../../_images/infer_bs_scaling_gpt_8b_128_20.svg

Average Latency, Average Throughput, and Model Size

Model size

Batch Size

Average Latency [ms]

Average Throughput [sentences/s]

TP

PP

GPUs

A100 80GB SXM4

H100 80GB HBM3

A100 80GB SXM4

H100 80GB HBM3

8B

1

238.6

151.9

4.2

6.6

1

1

1

8B

2

247.9

156.9

8.1

12.7

1

1

1

8B

4

273.4

165.5

14.6

24.2

1

1

1

8B

8

321.9

188.2

24.9

42.5

1

1

1

8B

1

170.6

117.4

5.9

8.5

2

1

2

8B

2

176.0

120.1

11.4

16.6

2

1

2

8B

4

191.0

126.1

20.9

31.7

2

1

2

8B

8

226.6

141.1

35.3

56.7

2

1

2

8B

1

131.8

97.3

7.6

10.3

4

1

4

8B

2

136.3

102.0

14.7

19.6

4

1

4

8B

4

147.7

107.2

27.1

37.3

4

1

4

8B

8

171.5

119.2

46.7

67.1

4

1

4

8B

1

121.0

88.7

8.3

11.3

8

1

8

8B

2

127.7

95.7

15.7

20.9

8

1

8

8B

4

140.3

102.0

28.5

39.2

8

1

8

8B

8

160.4

112.8

49.9

70.9

8

1

8

43B

1

631.2

395.1

1.6

2.5

2

1

2

43B

2

668.4

402.3

3.0

5.0

2

1

2

43B

4

735.2

424.6

5.4

9.4

2

1

2

43B

8

854.5

477.1

9.4

16.8

2

1

2

43B

1

394.9

258.2

2.5

3.9

4

1

4

43B

2

412.3

261.0

4.9

7.7

4

1

4

43B

4

448.2

275.9

8.9

14.5

4

1

4

43B

8

523.7

308.7

15.3

25.9

4

1

4

43B

1

301.0

210.9

3.3

4.7

8

1

8

43B

2

314.7

213.4

6.4

9.4

8

1

8

43B

4

343.1

223.4

11.7

17.9

8

1

8

43B

8

384.7

247.4

20.8

32.3

8

1

8

Configuration 2: Translation / Style Transfer use case

  • batch size: 1 - 8
    • input tokens length: 200

    • output tokens length: 200

../../_images/infer_modelsize_scaling_gpt_200_200.svg
../../_images/infer_gpu_scaling_gpt_8b_200_200.svg
../../_images/infer_bs_scaling_gpt_8b_200_200.svg

Average Latency, Average Throughput, and Model Size

Model size

Batch Size

Average Latency [ms]

Average Throughput [sentences/s]

TP

PP

GPUs

A100 80GB SXM4

H100 80GB HBM3

A100 80GB SXM4

H100 80GB HBM3

8B

1

2,290.6

1,435.7

0.4

0.7

1

1

1

8B

2

2,325.4

1,468.8

0.9

1.4

1

1

1

8B

4

2,478.7

1,506.3

1.6

2.7

1

1

1

8B

8

2,693.7

1,644.4

3.0

4.9

1

1

1

8B

1

1,558.9

1,047.4

0.6

1.0

2

1

2

8B

2

1,597.4

1,066.8

1.9

1.9

2

1

2

8B

4

1,653.7

1,095.3

2.4

3.7

2

1

2

8B

8

1,823.3

1,155.6

4.4

6.9

2

1

2

8B

1

1,167.3

849.8

0.9

1.2

4

1

4

8B

2

1,202.9

892.0

1.7

2.2

4

1

4

8B

4

1,260.3

915.3

3.2

4.4

4

1

4

8B

8

1,329.1

968.7

6.0

8.3

4

1

4

8B

1

1,057.8

747.6

0.9

1.3

8

1

8

8B

2

1,110.5

819.4

1.8

2.4

8

1

8

8B

4

1,187.1

855.9

3.4

4.7

8

1

8

8B

8

1,268.1

900.2

6.3

8.9

8

1

8

43B

1

6,117.2

3,817.2

0.2

0.3

2

1

2

43B

2

6,375.8

3,856.8

0.3

0.5

2

1

2

43B

4

6,616.7

3,919.8

0.6

1.0

2

1

2

43B

8

7,026.5

4,141.1

1.1

1.9

2

1

2

43B

1

3,754.8

2,437.0

0.3

0.4

4

1

4

43B

2

3,877.3

2,442.7

0.5

0.8

4

1

4

43B

4

3,974.5

2,503.3

1.0

1.6

4

1

4

43B

8

4,275.2

2,593.0

1.9

3.1

4

1

4

43B

1

2,810.5

1,953.9

0.4

0.5

8

1

8

43B

2

2,902.4

1,961.9

0.7

1.0

8

1

8

43B

4

3,024.5

2,000.7

1.3

2.0

8

1

8

43B

8

3,126.1

2,082.8

2.6

3.8

8

1

8