GPT Results - NVIDIA Docs

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide GPT Results

Training Accuracy Results

Training accuracy, NVIDIA DGX SuperPOD™:

8 × 8 × A100 80GB for 126M GPT Model
16 × 8 × A100 80GB for 5B GPT Model

NVIDIA evaluated the 126M parameter and 5B parameter models on eight different language tasks. The results are shown in the table below. All of the tasks are provided as part of the evaluation harness, so you can evaluate any .nemo checkpoint file on all of them.

Task	Metric	126M	5B
Lambada	Accuracy	38.70%	68.93%
	PPL	25.8	4.22
Boolq	Accuracy	56.94%	65.29%
Race	Accuracy	28.71%	38.66%
	Accuracy Norm	34.74%	41.62%
Piqa	Accuracy	61.21%	73.88%
	Accuracy Norm	61.97%	75.40%
Hellaswag	Accuracy	28.48%	46.45%
	Accuracy Norm	29.54%	60.85%
Winogrande	Accuracy	50.43%	60.77%
Wikitext2	Word PPL	31.35	12.36
	Byte PPL	1.9	1.6
	Bits per Byte PPL	0.64	0.47
Wikitext103	Word PPL	31.35	12.36
	Byte PPL	1.9	1.6
	Bits per Byte PPL	0.64	0.47

Training the 5B GPT model to convergence takes 6.5 days, and the loss curve is shown in the figure below.

5B GPT Training Loss

The table below shows the converged training loss, the throughput, and the total time to train for the 5B GPT model, using a given number of GPUs and a given Global Batch Size (GBS).

160

1440

2048

300B

1.685

726,384

4.8

Training Performance Results

Training performance:

NVIDIA DGX SuperPOD (16 × 8 × A100 80GB for 5B GPT model)
NVIDIA DGX SuperPODs (128 × 8 × A100 80GB for 175B GPT model)

NVIDIA measured the throughput of training 5B and 175B parameter GPT models on different numbers of DGX nodes and achieved near-linear scaling. For example, scaling from 1 node to 32 nodes with a 5B model yielded a 28.73x speed-up. Scaling from 8 nodes to 128 nodes (16 × more) with a 175B model yielded a 14.62 × speed-up. The tables and charts below show the performance results.

		Nodes
		1	2	4	8	16	32
5B	Tokens per Second	40345	79815	161754	312774	659481	1159288
	Perfect Linear Scaling (Tokens)	40345	80690	161380	322760	645520	1291040
	Speed-up	1x	1.98x	4.01x	7.75x	16.35x	28.73x

5B GPT NeMo Framework Throughput

			Nodes
		8	16	32	64	128
	Tokens per Second	7500	14950	29537	58211	109684
175B	Perfect Linear Scaling (Tokens)	7500	15000	30000	60000	120000
	Speed-up	1x	1.99x	3.94x	7.76x	14.62x

175B GPT NeMo Framework Throughput

Inference Performance

Inference performance was measured for - (1- 8 × A100 80GB SXM4) - (1- 8 × H100 80GB HBM3)

Configuration 1: Chatbot Conversation use case

batch size: 1 - 8
- input tokens length: 128
- output tokens length: 20

Average Latency, Average Throughput, and Model Size

Model size	Batch Size	Average Latency [ms]		Average Throughput [sentences/s]		TP	PP	GPUs
Model size	Batch Size	A100 80GB SXM4	H100 80GB HBM3	A100 80GB SXM4	H100 80GB HBM3	TP	PP	GPUs
8B	1	238.6	151.9	4.2	6.6	1	1	1
8B	2	247.9	156.9	8.1	12.7	1	1	1
8B	4	273.4	165.5	14.6	24.2	1	1	1
8B	8	321.9	188.2	24.9	42.5	1	1	1
8B	1	170.6	117.4	5.9	8.5	2	1	2
8B	2	176.0	120.1	11.4	16.6	2	1	2
8B	4	191.0	126.1	20.9	31.7	2	1	2
8B	8	226.6	141.1	35.3	56.7	2	1	2
8B	1	131.8	97.3	7.6	10.3	4	1	4
8B	2	136.3	102.0	14.7	19.6	4	1	4
8B	4	147.7	107.2	27.1	37.3	4	1	4
8B	8	171.5	119.2	46.7	67.1	4	1	4
8B	1	121.0	88.7	8.3	11.3	8	1	8
8B	2	127.7	95.7	15.7	20.9	8	1	8
8B	4	140.3	102.0	28.5	39.2	8	1	8
8B	8	160.4	112.8	49.9	70.9	8	1	8
43B	1	631.2	395.1	1.6	2.5	2	1	2
43B	2	668.4	402.3	3.0	5.0	2	1	2
43B	4	735.2	424.6	5.4	9.4	2	1	2
43B	8	854.5	477.1	9.4	16.8	2	1	2
43B	1	394.9	258.2	2.5	3.9	4	1	4
43B	2	412.3	261.0	4.9	7.7	4	1	4
43B	4	448.2	275.9	8.9	14.5	4	1	4
43B	8	523.7	308.7	15.3	25.9	4	1	4
43B	1	301.0	210.9	3.3	4.7	8	1	8
43B	2	314.7	213.4	6.4	9.4	8	1	8
43B	4	343.1	223.4	11.7	17.9	8	1	8
43B	8	384.7	247.4	20.8	32.3	8	1	8

Configuration 2: Translation / Style Transfer use case

batch size: 1 - 8
- input tokens length: 200
- output tokens length: 200

Average Latency, Average Throughput, and Model Size

Model size	Batch Size	Average Latency [ms]		Average Throughput [sentences/s]		TP	PP	GPUs
Model size	Batch Size	A100 80GB SXM4	H100 80GB HBM3	A100 80GB SXM4	H100 80GB HBM3	TP	PP	GPUs
8B	1	2,290.6	1,435.7	0.4	0.7	1	1	1
8B	2	2,325.4	1,468.8	0.9	1.4	1	1	1
8B	4	2,478.7	1,506.3	1.6	2.7	1	1	1
8B	8	2,693.7	1,644.4	3.0	4.9	1	1	1
8B	1	1,558.9	1,047.4	0.6	1.0	2	1	2
8B	2	1,597.4	1,066.8	1.9	1.9	2	1	2
8B	4	1,653.7	1,095.3	2.4	3.7	2	1	2
8B	8	1,823.3	1,155.6	4.4	6.9	2	1	2
8B	1	1,167.3	849.8	0.9	1.2	4	1	4
8B	2	1,202.9	892.0	1.7	2.2	4	1	4
8B	4	1,260.3	915.3	3.2	4.4	4	1	4
8B	8	1,329.1	968.7	6.0	8.3	4	1	4
8B	1	1,057.8	747.6	0.9	1.3	8	1	8
8B	2	1,110.5	819.4	1.8	2.4	8	1	8
8B	4	1,187.1	855.9	3.4	4.7	8	1	8
8B	8	1,268.1	900.2	6.3	8.9	8	1	8
43B	1	6,117.2	3,817.2	0.2	0.3	2	1	2
43B	2	6,375.8	3,856.8	0.3	0.5	2	1	2
43B	4	6,616.7	3,919.8	0.6	1.0	2	1	2
43B	8	7,026.5	4,141.1	1.1	1.9	2	1	2
43B	1	3,754.8	2,437.0	0.3	0.4	4	1	4
43B	2	3,877.3	2,442.7	0.5	0.8	4	1	4
43B	4	3,974.5	2,503.3	1.0	1.6	4	1	4
43B	8	4,275.2	2,593.0	1.9	3.1	4	1	4
43B	1	2,810.5	1,953.9	0.4	0.5	8	1	8
43B	2	2,902.4	1,961.9	0.7	1.0	8	1	8
43B	4	3,024.5	2,000.7	1.3	2.0	8	1	8
43B	8	3,126.1	2,082.8	2.6	3.8	8	1	8

Previous Model Deployment

Next BERT