Performance#

As part of the NVIDIA NeMo Framework, Megatron Bridge, provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput.

This page provides performance benchmarks for large language models using Megatron-Bridge across different GPU systems and configurations.

Nomenclature#

GBS: Global Batch Size
MBS: Micro Batch Size
FSDP: Fully Sharded Data Parallel
- FSDP > 0: use FSDP with sharding group size = #GPUs / (TP × PP)
- FSDP = 0: use DDP (Distributed Data Parallel)
TP: Tensor Parallel Size
PP: Pipeline Parallel Size
CP: Context Parallel Size
VP: Virtual Pipeline Parallel Size
EP: Expert Parallel Size
GA: Number of Gradient Accumulations

Performance Metrics#

Performance is measured using:

Tokens/sec/GPU: Throughput per GPU
Model TFLOP/sec/GPU: Model floating-point operations per second per GPU

Performance Summary for Large Language Models#

Below are performance benchmarks for various large language models. These results were obtained using performance recipes available here.

The performance data includes:

Pre-training Performance: Throughput metrics for various model sizes and architectures[1]
System Configurations: Results across different GPU systems (DGX-GB300, DGX-GB200, DGX-B300, DGX-B200, DGX-H100)
Precision Options: Performance comparisons between different precision modes (BF16, FP8, MXFP8, NVFP4)

26.04.01 NeMo Container#

Pre-Training Performance#

Model: LLAMA3_70B#

System	#-GPUs	Precision	GBS	MBS	Sequence Length	FSDP	TP	PP	CP	VP	EP	Tokens / sec / GPU	Model TFLOP / sec / GPU
DGX-GB300	64	FP8	256	2	8192	64	1	1	1	n/a	n/a	5248	2348
DGX-GB300	64	MXFP8	256	1	8192	0	1	4	1	5	n/a	4864	2186
DGX-GB300	64	NVFP4	256	1	8192	0	1	4	1	5	n/a	7296	3253
DGX-GB200	64	FP8	256	2	8192	64	1	1	1	n/a	n/a	4224	1892
DGX-GB200	64	MXFP8	256	1	8192	0	2	4	1	5	n/a	3712	1664
DGX-GB200	64	NVFP4	256	1	8192	0	2	4	1	5	n/a	4864	2202
DGX-H100	64	FP8	256	1	8192	0	4	8	1	5	n/a	1664	731

Model: LLAMA3.1_405B#

System	#-GPUs	Precision	GBS	MBS	Sequence Length	TP	PP	CP	VP	EP	Tokens / sec / GPU	Model TFLOP / sec / GPU
DGX-GB300	256	FP8	1536	1	8192	4	8	1	4	n/a	1024	2617
DGX-GB300	256	MXFP8	1536	1	8192	2	8	2	4	n/a	960	2453
DGX-GB300	256	NVFP4	1536	1	8192	4	8	1	4	n/a	1440	3653
DGX-GB200	256	FP8	1536	1	8192	4	16	1	4	n/a	864	2144
DGX-GB200	256	MXFP8	1536	1	8192	4	16	1	8	n/a	800	1994
DGX-GB200	256	NVFP4	1536	1	8192	4	16	1	8	n/a	1184	2960
DGX-H100	1024	FP8	1536	1	8192	8	8	2	8	n/a	328	827

Model: DeepSeekV3#

System	#-GPUs	Precision	GBS	MBS	Sequence Length	TP	PP	CP	VP	EP	Tokens / sec / GPU	Model TFLOP / sec / GPU
DGX-GB300	256	MXFP8	4096	2	4096	1	2	1	8	32	4992	1298
DGX-GB200	256	MXFP8	4096	1	4096	1	4	1	4	64	4256	1106
DGX-B300	256	MXFP8	4096	2	4096	1	8	1	n/a	8	3456	898
DGX-B200	256	MXFP8	4096	1	4096	1	8	1	2	32	3328	864

Model: GPT OSS 120B#

System	#-GPUs	Precision	GBS	MBS	Sequence Length	TP	PP	CP	VP	EP	Tokens / sec / GPU	Model TFLOP / sec / GPU
DGX-GB300	64	BF16	1280	4	4096	1	1	1	n/a	64	19200	523
DGX-GB200	64	BF16	1280	4	4096	1	1	1	n/a	64	16640	452
DGX-B300	64	BF16	1280	4	4096	1	1	1	n/a	8	15232	414
DGX-B200	64	BF16	1280	4	4096	1	1	1	n/a	8	13568	369
DGX-H100	64	BF16	1280	1	4096	1	4	1	n/a	8	5824	158

Model: Qwen3_30B_a3B#

System	#-GPUs	Precision	GBS	MBS	Sequence Length	TP	PP	CP	VP	EP	Tokens / sec / GPU	Model TFLOP / sec / GPU
DGX-GB300	8	MXFP8	512	8	4096	1	1	1	n/a	8	31744	729
DGX-GB200	8	MXFP8	512	4	4096	1	1	1	n/a	8	26112	599
DGX-B300	8	MXFP8	512	8	4096	1	1	1	n/a	8	30720	704
DGX-B200	8	MXFP8	512	4	4096	1	1	1	n/a	8	27136	619
DGX-H100	16	FP8	1024	1	4096	1	1	1	n/a	16	8960	206

Model: Qwen3_235B_a22B#

System	#-GPUs	Precision	GBS	MBS	Sequence Length	TP	PP	CP	VP	EP	Tokens / sec / GPU	Model TFLOP / sec / GPU
DGX-GB300	256	MXFP8	8192	2	4096	1	4	1	12	32	6944	1029
DGX-GB200	256	MXFP8	8192	1	4096	1	8	1	3	32	5680	840
DGX-B300	256	MXFP8	8192	2	4096	1	8	1	n/a	8	5936	878
DGX-B200	256	MXFP8	8192	1	4096	1	8	1	n/a	8	3776	560
DGX-H100	256	FP8	8192	1	4096	2	8	1	4	32	1712	253

Model: Kimi_K2#

System	#-GPUs	Precision	GBS	MBS	Sequence Length	FSDP	TP	PP	CP	VP	EP	Tokens / sec / GPU	Model TFLOP / sec / GPU
DGX-GB300	256	MXFP8	4096	2	4096	0	1	4	1	4	64	5328	1088

Muon optimizer was used for pre-training Kimi-K2.

Model: Nemotron_3_Nano#

System	#-GPUs	Precision	GBS	MBS	Sequence Length	TP	PP	CP	VP	EP	Tokens / sec / GPU	Model TFLOP / sec / GPU
DGX-GB300	8	MXFP8	512	4	8192	1	1	1	n/a	8	37888	845
DGX-GB200	8	MXFP8	512	2	8192	1	1	1	n/a	8	32768	725
DGX-B300	8	MXFP8	512	4	8192	1	1	1	n/a	8	35840	794
DGX-B200	8	MXFP8	512	2	8192	1	1	1	n/a	8	32768	726
DGX-H100	16	FP8	1024	1	8192	1	1	1	n/a	8	14336	321

Model: Nemotron_3_Super#

System	#-GPUs	Precision	GBS	MBS	Sequence Length	TP	PP	CP	VP	EP	Tokens / sec / GPU	Model TFLOP / sec / GPU
DGX-GB300	64	MXFP8	512	1	8192	1	1	1	n/a	64	9344	795
DGX-GB300	64	NVFP4	512	1	8192	1	1	1	n/a	64	9600	817
DGX-GB200	64	MXFP8	512	1	8192	2	1	1	n/a	64	6656	564
DGX-GB200	64	NVFP4	512	1	8192	2	1	1	n/a	64	6784	574
DGX-B300	64	MXFP8	512	1	8192	1	1	1	n/a	8	7296	623
DGX-B300	64	NVFP4	512	1	8192	1	1	1	n/a	8	7424	634
DGX-B200	64	MXFP8	512	1	8192	1	1	1	n/a	64	6400	542
DGX-B200	64	NVFP4	512	1	8192	2	1	1	n/a	64	5632	475[2]

Archive#

Performance summary for past releases can be found in the archive.