Performance

Training Accuracy Results

Training accuracy, NVIDIA DGX SuperPOD™:

  • 8 × 8 × A100 80GB for 126M GPT Model

  • 16 × 8 × A100 80GB for 5B GPT Model

NVIDIA evaluated the 126M parameter and 5B parameter models on eight different language tasks. The results are shown in the table below. All of the tasks are provided as part of the evaluation harness, so you can evaluate any .nemo checkpoint file on all of them.

Task

Metric

126M

5B

Lambada

Accuracy

38.70%

68.93%

PPL

25.8

4.22

Boolq

Accuracy

56.94%

65.29%

Race

Accuracy

28.71%

38.66%

Accuracy Norm

34.74%

41.62%

Piqa

Accuracy

61.21%

73.88%

Accuracy Norm

61.97%

75.40%

Hellaswag

Accuracy

28.48%

46.45%

Accuracy Norm

29.54%

60.85%

Winogrande

Accuracy

50.43%

60.77%

Wikitext2

Word PPL

31.35

12.36

Byte PPL

1.9

1.6

Bits per Byte PPL

0.64

0.47

Wikitext103

Word PPL

31.35

12.36

Byte PPL

1.9

1.6

Bits per Byte PPL

0.64

0.47

Training the 5B GPT model to convergence takes 6.5 days, and the loss curve is shown in the figure below.

5B_GPT_3_loss_final.svg

5B GPT Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 5B GPT model, using a given number of GPUs and a given Global Batch Size (GBS).

160 1440 2048 300B 1.685 726,384 4.8

Training Performance Results

Training performance:

  • NVIDIA DGX SuperPOD (16 × 8 × A100 80GB for 5B GPT model)

  • NVIDIA DGX SuperPODs (128 × 8 × A100 80GB for 175B GPT model)

NVIDIA measured the throughput of training 5B and 175B parameter GPT models on different numbers of DGX nodes and achieved near-linear scaling. For example, scaling from 1 node to 32 nodes with a 5B model yielded a 28.73x speed-up. Scaling from 8 nodes to 128 nodes (16 × more) with a 175B model yielded a 14.62 × speed-up. The tables and charts below show the performance results.

Nodes

1

2

4

8

16

32

Tokens per Second

40345

79815

161754

312774

659481

1159288

5B

Perfect Linear Scaling (Tokens)

40345

80690

161380

322760

645520

1291040

Speed-up

1x

1.98x

4.01x

7.75x

16.35x

28.73x

5B_GPT_3_throughput.svg

5B GPT NeMo Framework Throughput

Nodes

8

16

32

64

128

Tokens per Second

7500

14950

29537

58211

109684

175B

Perfect Linear Scaling (Tokens)

7500

15000

30000

60000

120000

Speed-up

1x

1.99x

3.94x

7.76x

14.62x

175B_GPT_3_throughput.svg

175B GPT NeMo Framework Throughput

Inference Performance

Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB).

Inference configurations:

  • batch size: 1

  • input tokens length: 60

  • output tokens length: 20

infer_model_size_gpt3.svg

Average Latency vs GPT Model Size

GPT Model size

Average latency [ms]

TP

PP

GPUs

5B

87

8

4

32

20B

202

8

4

32

175B

893

8

4

32

530B

977

32

1

32

T5 Results

Training Accuracy Results

You can also prompt-learn on top of any .nemo trained checkpoint file on the SQuAD task mentioned in the section T5 and mT5 Prompt Learning. The results are shown in the table below.

Task

Metric

220M

3B

SQuAD

Exact Match

74.20

78.52

SQuAD

F1

84.54

87.17

Training the 220M T5 model to convergence takes 4 days, and the loss curve is shown in the figure below:

220M_T5_loss_final.svg

220M T5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 220M T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

32 2048 512 1T 1.501 3,273,728 4

Training the 3B T5 model to convergence takes 11 days, and the loss curve of a fully trained model can be seen in the figure below:

3B_T5_loss_100percent.svg

3B T5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

160 2160 512 1T 1.147 1,395,131 11

Training Performance Results

Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B T5 Model)

NVIDIA measured the throughput of training a 3B parameter T5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 16.38× speed-up.

NVIDIA is actively working on improving the scaling performance for T5 models. The table and chart below show the performance results.

Nodes

1

2

4

5

10

20

Tokens per Second

110769

215579

417644

515100

957506

1626353

3B

Perfect Linear Scaling (Tokens)

110769

221538

443077

553846

1107692

2215385

Speed-up

1x

1.95x

3.77x

4.65x

8.64x

14.68x

3B_T5_throughput_2208.svg

3B T5 NeMo Framework Throughput

Inference Performance

Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB). The results are shown in the table below.

Inference configurations:

  • Batch size: 1

  • Input tokens length: 60

  • Output tokens length: 20

infer_model_size_t5.svg

Average Latency vs T5 Model Size

T5 Model size

Average latency [ms]

TP

PP

GPUs

3B

94

2

1

2

11B

123

4

1

4

23B

213

4

1

4

41B

332

8

1

8

mT5 Results

Training Accuracy Results

Training accuracy: NVIDIA DGX SuperPOD

  • 4 × 8 × A100 80GB for 170M mT5 model

  • 8 × 8 × A100 80GB for 390M mT5 model

  • 20 × 8 × A100 80GB for 3B mT5 model)

NVIDIA evaluated the mT5 models on an XQuAD task. The results are shown in the table below. You can fine-tune on top of any .nemo-trained checkpoint file on an XQuAD task mentioned in the section mT5 Fine-Tuning.

Task-Language

Metric

170M

390M

XQuAD-de

Exact Match

43.0

54.7

XQuAD-en

Exact Match

63.8

68.8

XQuAD-es

Exact Match

47.0

55.3

XQuAD-hi

Exact Match

34.5

47.1

XQuAD-zh

Exact Match

46.8

56.1

You can also prompt-learn on top of any .nemo trained checkpoint file on a SQuAD task mentioned in the section T5 and mT5 Prompt Learning.

The results are shown in the table below.

Task

Metric

390M

3B

SQuAD

Exact Match

76.86

81.55

SQuAD

F1

84.67

89.34

Training the 170M mT5 model to convergence takes 4 days. The figure below shows the loss curve.

170M_mT5_loss_final.svg

170M mT5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 170M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).

32 2048 512 1T 1.980 4,112,062 4

Training the 390M mT5 model to convergence takes 4 days. The figure below shows the loss curve.

390M_mT5_loss_final.svg

390M mT5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 390M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).

64 2048 512 1T 1.584 3,744,914 4

Training the 3B mT5 model to convergence takes 14 days. The figure below shows the loss curve of a fully trained model:

3B_mT5_loss_final.svg

3B mT5 Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

169 1920 512 1T 1.134 911,065 14

Training Performance Results

Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 3B mT5 model)

NVIDIA measured the throughput of training a 3B parameter mT5 model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 20 nodes yielded a 14.87× speed-up.

NVIDIA is actively working on improving the scaling performance for mT5 models. The table and chart below show the performance results.

Nodes

1

2

4

5

10

20

Tokens per Second

91166

179583

346263

429088

798570

1303767

3B

Perfect Linear Scaling (Tokens)

91166

182331

364663

455829

911657

1823314

Speed-up

1x

1.97x

3.8x

4.71x

8.76x

14.3x

3B_mT5_throughput_2208.svg

3B mT5 NeMo Framework Throughput

Inference Performance

Inference performance was measured for NVIDIA DGX SuperPOD (1 × 8 × A100 80GB).

Inference configurations:

  • Batch size: 1

  • Input tokens length: 60

  • Output tokens length: 20

infer_model_size_mt5.svg

Average Latency vs mT5 Model Size

mT5 Model size

Average latency [ms]

TP

PP

GPUs

380M

35

1

1

1

3B

102

2

1

2

11B

134

4

1

4

23B

230

4

1

4

BERT Results

Training Accuracy Results

Training accuracy: NVIDIA DGX SuperPOD (16 × 8 × A100 80GB for 4b BERT model)

Training the 4B BERT model for 95 Billion takes 1.5 days. The figure below shows the loss curve.

4b_bert_loss_final.png

4B BERT Training Loss (220B Tokens)

The table below shows the converged training loss, the throughput, and the total time to train for the 4B BERT model, using a given number of GPUs and a given Global Batch Size (GBS).

Training Performance Results

Training performance: NVIDIA DGX SuperPOD (20 × 8 × A100 80GB for 4B BERT model)

NVIDIA measured the throughput of training a 4B parameter BERT model on NVIDIA DGX SuperPOD using different numbers of nodes. Scaling from 1 node to 16 nodes yielded a 12.71× speed-up. The table and chart below show the performance results.

Nodes

1

2

4

8

16

Tokens per Second

57287

108695

215358

393167

728178

4B

Perfect Linear Scaling (Tokens)

57287

114574

229148

458296

916592

Speed-up

1x

1.89x

3.75x

6.86x

12.71x

4B_bert_throughput_2211.png

4B BERT NeMo Framework Throughput

© Copyright 2023, NVIDIA. Last updated on Oct 13, 2023.