Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Performance#

The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.

Below, you can see performance benchmarks for various large language models.

Performance Summary for Large Language Models#

Pretraining#

The table below shows the pre-training performance of various models at FP8 precision (using NeMo 2.0).

Container: NeMo 25.02
System: DGX-B200

Model	#-GPUs	GBS	MBS	Sequence Length	TP	PP	CP	VP	EP	Tokens / sec / GPU	Model TFLOP / sec / GPU	Est. time to train in days (10T tokens, 1K GPUs)
GPT3-175B	512	2048	2	2048	4	4	1	6	1	1600	1722	71
LLAMA3-8B	8	128	2	8192	1	1	1	1	1	26006	1506	4
LLAMA3-70B	64	128	1	8192	2	4	2	5	1	3062	1474	37
LLAMA3-405B	128	64	1	8192	4	8	2	8	1	625	1658	181
Nemotron-15B	64	256	2	4096	1	1	1	1	1	14760	1387	8
Nemotron-340B	128	32	1	4096	8	4	1	12	1	602	1268	188
Mixtral-8x7B	64	256	2	4096	1	1	1	1	8	15457	1282	7
Mixtral-8x22B	256	1	64	65536	2	4	8	14	8	2232	824	51

System: DGX-H100

Model	#-GPUs	GBS	MBS	Sequence Length	TP	PP	CP	VP	EP	Tokens / sec / GPU	Model TFLOP / sec / GPU	Est. time to train in days (10T tokens, 1K GPUs)
GPT3-175B	512	2048	2	2048	4	8	1	6	1	866	932	131
LLAMA3-8B	8	128	1	8192	1	1	2	1	1	14201	822	8
LLAMA3-70B	64	128	1	8192	4	8	1	5	1	1662	800	68
LLAMA3-405B	1024	512	1	8192	8	8	2	8	1	327	866	346
Nemotron-15B	64	256	2	4096	2	1	1	1	1	8233	774	14
Nemotron-340B	256	64	1	4096	8	8	1	12	1	346	728	327
Mixtral-8x7B	64	256	1	4096	1	4	1	8	8	8233	683	14
Mixtral-8x22B	256	1	256	65536	4	4	8	14	8	1278	471	88

Fine-Tuning#

The table below presents the fine-tuning performance of LLaMA2 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision (using NeMo 2.0).

Container: NeMo 25.02

For fine-tuning, we use the SQuAD-v1.1 dataset, with inputs packed to 4096 tokens.

System: DGX-B200

Model	Task	#-GPUs	GBS	MBS	Packed Sequence Length	TP	PP	VP	Tokens / sec / GPU	Model TFLOP / sec / GPU	Est. time to complete in mins (10M tokens)
LLAMA3-8B	SFT	8	8	1	16384	1	1	1	31508	1428	0.66
LLAMA3-70B	SFT	32	32	1	4096	2	4	5	3357	1400	1.55
LLAMA3-8B	LoRA	8	8	1	16384	1	1	1	43116	1307	0.48
LLAMA3-70B	LoRA	8	32	1	4096	1	4	20	5669	1579	3.67
LLAMA3-405B	LoRA	32	32	1	2048	4	4	4	759	1231	6.87

System: DGX-H100

Model	Task	#-GPUs	GBS	MBS	Packed Sequence Length	TP	PP	VP	Tokens / sec / GPU	Model TFLOP / sec / GPU	Est. time to complete in mins (10M tokens)
LLAMA3-8B	SFT	8	32	1	4096	1	1	1	17246	779	1.21
LLAMA3-70B	SFT	32	32	1	4096	4	4	5	1789	746	2.91
LLAMA3-8B	LoRA	8	32	1	4096	1	1	1	23406	707	0.89
LLAMA3-70B	LoRA	8	32	1	4096	2	4	20	2768	771	7.53
LLAMA3-405B	LoRA	32	32	1	2048	4	8	8	521	846	9.99

Performance Numbers for previous NeMo Container Releases#

24.12 NeMo container