Performance Benchmarks#

Large Language Models**#

The results in the table below show pre-training performance for various tasks at FP8 precision.
- Container: NeMo24.07
- System: DGX-H100

Model	#-GPUs	GBS	MBS	Sequence Length	TP	PP	CP	VP	Tokens / sec / GPU	Model TFLOP / sec / GPU	Est. time to train in days (10T tokens, 1K GPUs)
GPT3-5B	64	2048	4	2048	1	1	1	1	23406	765	5
GPT3-20B	64	256	2	2048	2	1	1	1	5851	750	19
GPT3-175B	128	256	1	2048	4	8	1	6	716	771	158
GPT3-175B	512	2048	2	2048	4	8	1	6	825	888	137
LLAMA2-7B	8	128	1	4096	1	1	1	1	16934	780	7
LLAMA2-13B	16	128	1	4096	1	4	1	10	8715	760	13
LLAMA2-70B	64	128	1	4096	4	4	1	20	1728	768	65
Nemotron-8B	64	256	4	4096	2	1	1	1	12507	643	9
Nemotron-22B	64	256	2	4096	2	4	1	10	4312	562	26
Nemotron-340B	128	32	1	4096	8	8	1	12	326	686	347
LLAMA3-8B	8	128	1	8192	1	1	2	1	12273	711	9
LLAMA3-70B	64	128	1	8192	4	4	2	5	1524	734	74

The results in the table below show finetuning performance of LLAMA2 models with SFT (supervised fine-tuning), and LoRA (Low-rank adaptors) at FP8 precision.
- Container: NeMo24.07
- System: DGX-H100
For fine-tuning, we use SQuAD-v1.1 <https://rajpurkar.github.io/SQuAD-explorer/>__ dataset, and the inputs are packed to 4096 tokens.

Model	Task	#-GPUs	GBS	MBS	Packed Sequence Length	TP	PP	Tokens / sec / GPU	Model TFLOP / sec / GPU	Est. time to finetune in mins (10M tokens)
LLAMA2-7B	SFT	8	32	1	4096	1	1	16891	673	1.2
LLAMA2-13B	SFT	8	32	1	4096	1	4	10176	787	2.0
LLAMA2-70B	SFT	16	32	1	4096	4	4	1816	749	5.7
LLAMA2-7B	LoRA	8	32	1	4096	1	1	24824	663	0.8
LLAMA2-13B	LoRA	8	32	1	4096	1	1	14629	757	1.4
LLAMA2-70B	LoRA	8	32	1	4096	2	4	2621	722	7.9