Performance#

The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.

Below, you can see performance benchmarks for various large language models.

Performance Summary for Large Language Models#

Pre-Training Performance#

The table below highlights the pre-training performance of various models at FP8 precision with NeMo 2.0.

Container: NeMo 24.12
System: DGX-H100

Model	#-GPUs	GBS	MBS	Sequence Length	TP	PP	CP	VP	Tokens / sec / GPU	Model TFLOP / sec / GPU	Est. time to train in days (10T tokens, 1K GPUs)
GPT3-175B	128	256	1	2048	4	8	1	6	794	854 (dropout > 0)	142
GPT3-175B	512	2048	2	2048	4	8	1	6	850	915	133
LLAMA3-8B	8	128	1	8192	1	1	2	1	14064	814	8
LLAMA3-70B	64	128	1	8192	4	4	2	5	1633	786	69
LLAMA3-405B	576	252	1	8192	8	9	2	7	312	827	362
Nemotron-8B	64	256	4	4096	2	1	1	1	13003	668	9
Nemotron-15B	64	256	4	4096	4	1	1	1	7550	710	15
Nemotron-22B	64	256	2	4096	2	4	1	10	5831	759	19
Nemotron-340B	128	32	1	4096	8	8	1	12	367	773	308

Fine-Tuning Performance#

The table below highlights the fine-tuning performance for Llama3 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision with NeMo 2.0.

Container: NeMo 24.12
System: DGX-H100

For fine-tuning, we use the SQuAD-v1.1 dataset with inputs packed to 4096 tokens.

Model	Task	#-GPUs	GBS	MBS	Packed Sequence Length	TP	PP	VP	Tokens / sec / GPU	Model TFLOP / sec / GPU	Est. time to complete in mins (10M tokens)
LLAMA3-8B	SFT	8	32	1	4096	1	1	1	16891	763	1.23
LLAMA3-70B	SFT	32	32	1	4096	4	4	5	1672	697	3.12
LLAMA3-8B	LoRA	8	32	1	4096	1	1	1	23406	707	0.89
LLAMA3-70B	LoRA	8	32	1	4096	2	4	20	2758	768	7.55
LLAMA3-405B	LoRA	24	24	1	2048	4	6	7	509	827	13.63