Is this page helpful?

Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Performance#

The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.

Below, you can see performance benchmarks for various large language models.

Performance Summary for Large Language Models#

Pretraining#

The table below shows the pre-training performance of various models at FP8 precision (using NeMo 2.0).

Container: NeMo 24.09
System: DGX-H100

Model	#-GPUs	GBS	MBS	Sequence Length	TP	PP	CP	VP	Tokens / sec / GPU	Model TFLOP / sec / GPU	Est. time to train in days (10T tokens, 1K GPUs)
GPT3-175B	128	256	1	2048	4	8	1	6	747	805 (dropout > 0)	151
GPT3-175B	512	2048	2	2048	4	8	1	6	845	910	134
LLAMA3-8B	8	128	1	8192	1	1	2	1	13443	779	8
LLAMA3-70B	64	128	1	8192	4	4	2	5	1557	750	73
LLAMA3-405B	576	252	1	8192	8	9	2	7	314	833	360
Nemotron-8B	64	256	4	4096	2	1	1	1	12701	653	9
Nemotron-22B	64	256	2	4096	2	4	1	10	4980	649	23
Nemotron-340B	128	32	1	4096	8	8	1	12	346	728	327

Fine-Tuning#

The table below presents the fine-tuning performance of LLaMA2 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision (using NeMo 2.0).

Container: NeMo 24.09
System: DGX-H100

For fine-tuning, we use the SQuAD-v1.1 dataset, with inputs packed to 4096 tokens.

Model	Task	#-GPUs	GBS	MBS	Packed Sequence Length	TP	PP	VP	Tokens / sec / GPU	Model TFLOP / sec / GPU	Est. time to complete in mins (10M tokens)
LLAMA3-8B	SFT	8	32	1	4096	1	1	1	16063	726	1.30
LLAMA3-70B	SFT	32	32	1	4096	4	4	5	1645	686	3.17
LLAMA3-8B	LoRA	8	32	1	4096	1	1	1	21845	660	0.95
LLAMA3-70B	LoRA	8	32	1	4096	2	4	20	2744	764	7.59
LLAMA3-405B	LoRA	24	24	1	2048	4	6	7	513	833	13.53