Performance#

The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.

Performance Summary for Large Language Models#

Below are performance benchmarks for various large language models. These results were obtained using a version of the performance recipes available here.

Abbreviations:
- GBS: Global Batch Size
- MBS: Micro Batch Size
- FSDP: Fully Sharded Data Parallel
  - FSDP = 1: use FSDP
  - FSDP = 0: use DDP (Distributed Data Parallel)
- TP: Tensor Parallel Size
- PP: Pipeline Parallel Size
- CP: Context Parallel Size
- VP: Virtual Pipeline Parallel Size
- EP: Expert Parallel Size
- GA: Number of Gradient Accumulations

Pre-Training Performance#

The table below summarizes the pre-training performance of various models using FP8 precision. We apply per-tensor FP8 quantization, leveraging scaling factors computed in the current step for both pre-training and fine-tuning.

Container: NeMo 25.04
System: DGX-GB200

Model	#-GPUs	GBS	MBS	Sequence Length	FSDP	TP	PP	CP	VP	EP	GA	Tokens / sec / GPU	Model TFLOP / sec / GPU
GPT3-175B	128	256	2	2048	0	4	4	1	12	1	16	1804	1942
LLAMA3-8B	8	128	2	8192	0	1	1	1	1	1	8	33099	1917
LLAMA3-70B	64	128	1	8192	1	1	1	1	1	1	2	4035	1943
LLAMA3-405B	128	64	1	8192	0	4	8	2	8	1	32	665	1763
Nemotron-15B	64	256	2	4096	0	1	1	1	1	1	2	17809	1674
Nemotron-340B	128	32	1	4096	0	8	4	1	12	1	8	737	1551
Mixtral-8x7B	64	256	2	4096	0	1	1	1	1	8	2	19275	1598
Mixtral-8x22B	256	1	64	65536	0	2	4	8	14	8	16	2731	1008

System: DGX-B200

Model	#-GPUs	GBS	MBS	Sequence Length	TP	PP	CP	VP	EP	GA	Tokens / sec / GPU	Model TFLOP / sec / GPU
GPT3-175B	512	2048	2	2048	4	4	1	6	1	32	1523	1639
LLAMA3-8B	8	128	2	8192	1	1	1	1	1	8	30411	1761
LLAMA3-70B	64	128	1	8192	2	4	2	5	1	32	3562	1715
LLAMA3-405B	128	64	1	8192	4	8	2	8	1	32	651	1726
Nemotron-15B	64	256	2	4096	1	1	1	1	1	2	16222	1525
Nemotron-340B	128	32	1	4096	8	4	1	12	1	8	632	1330
Mixtral-8x7B	64	256	2	4096	1	1	1	1	8	2	17617	1461
Mixtral-8x22B	256	1	64	65536	2	4	8	14	8	16	2399	885

System: DGX-H100

Model	#-GPUs	GBS	MBS	Sequence Length	TP	PP	CP	VP	EP	GA	Tokens / sec / GPU	Model TFLOP / sec / GPU
GPT3-175B	512	2048	2	2048	4	8	1	6	1	64	824	887
LLAMA3-8B	8	128	1	8192	1	1	2	1	1	32	13812	800
LLAMA3-70B	64	128	1	8192	4	8	1	5	1	64	1621	780
LLAMA3-405B	1024	512	1	8192	8	8	2	8	1	64	315	834
Nemotron-15B	64	256	2	4096	2	1	1	1	1	4	7915	744
Nemotron-340B	256	64	1	4096	8	8	1	12	1	16	335	704
Mixtral-8x7B	64	256	1	4096	1	4	1	8	8	16	7992	663
Mixtral-8x22B	256	1	256	65536	4	4	8	14	8	32	1277	471

Fine-Tuning Performance#

The table below highlights the fine-tuning performance for Llama3 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision with NeMo 2.0.

Container: NeMo 25.04

For fine-tuning, we use the SQuAD-v1.1 dataset with inputs packed to 4096 tokens.

System: DGX-GB200

Model	Task	#-GPUs	GBS	MBS	Packed Sequence Length	TP	PP	VP	GA	Tokens / sec / GPU	Model TFLOP / sec / GPU
LLAMA3-8B	SFT	8	8	1	16384	1	1	1	1	33437	1515
LLAMA3-70B	SFT	32	32	1	4096	2	4	5	8	3977	1658
LLAMA3-8B	LoRA	8	8	1	16384	1	1	1	1	44281	1342
LLAMA3-70B	LoRA	8	32	1	4096	1	4	20	16	3238	1804
LLAMA3-405B	LoRA	32	32	1	2048	4	4	4	16	966	1568

System: DGX-B200

Model	Task	#-GPUs	GBS	MBS	Packed Sequence Length	TP	PP	VP	GA	Tokens / sec / GPU	Model TFLOP / sec / GPU
LLAMA3-8B	SFT	8	8	1	16384	1	1	1	1	30913	1401
LLAMA3-70B	SFT	32	32	1	4096	2	4	5	8	3657	1525
LLAMA3-8B	LoRA	8	8	1	16384	1	1	1	1	42010	1274
LLAMA3-70B	LoRA	8	32	1	4096	1	4	20	16	5611	1563
LLAMA3-405B	LoRA	32	32	1	2048	4	4	4	16	678	1101

System: DGX-H100

Model	Task	#-GPUs	GBS	MBS	Packed Sequence Length	TP	PP	VP	GA	Tokens / sec / GPU	Model TFLOP / sec / GPU
LLAMA3-8B	SFT	8	32	1	4096	1	1	1	4	16222	733
LLAMA3-70B	SFT	32	32	1	4096	4	4	5	16	1619	675
LLAMA3-8B	LoRA	8	32	1	4096	1	1	1	4	22141	669
LLAMA3-70B	LoRA	8	32	1	4096	2	4	20	32	2621	730
LLAMA3-405B	LoRA	32	32	1	2048	4	8	8	32	496	805

Performance Numbers for Previous NeMo Container Releases#

Performance Summary Archive