Performance#
As part of the NVIDIA NeMo Framework, Megatron Bridge, provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput.
This page provides performance benchmarks for large language models using Megatron-Bridge across different GPU systems and configurations.
Nomenclature#
GBS: Global Batch Size
MBS: Micro Batch Size
FSDP: Fully Sharded Data Parallel
FSDP > 0: use FSDP with sharding group size = #GPUs / (TP × PP)
FSDP = 0: use DDP (Distributed Data Parallel)
TP: Tensor Parallel Size
PP: Pipeline Parallel Size
CP: Context Parallel Size
VP: Virtual Pipeline Parallel Size
EP: Expert Parallel Size
GA: Number of Gradient Accumulations
Performance Metrics#
Performance is measured using:
Tokens/sec/GPU: Throughput per GPU
Model TFLOP/sec/GPU: Model floating-point operations per second per GPU
Performance Summary for Large Language Models#
Below are performance benchmarks for various large language models. These results were obtained using performance recipes available here.
The performance data includes:
Pre-training Performance: Throughput metrics for various model sizes and architectures
System Configurations: Results across different GPU systems (DGX-GB300, DGX-GB200, DGX-B300, DGX-B200, DGX-H100)
Precision Options: Performance comparisons between different precision modes (BF16, FP8, MXFP8)
26.02.01 NeMo Container#
Pre-Training Performance#
Model: LLAMA3_70B#
System |
#-GPUs |
Precision |
GBS |
MBS |
Sequence Length |
FSDP |
TP |
PP |
CP |
VP |
EP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DGX-GB300 |
64 |
NVFP4 |
256 |
2 |
8192 |
0 |
1 |
1 |
1 |
n/a |
n/a |
7002 |
3147 |
DGX-GB200 |
64 |
NVFP4 |
256 |
1 |
8192 |
0 |
2 |
4 |
1 |
5 |
n/a |
4557 |
2047 |
DGX-GB300 |
64 |
MXFP8 |
256 |
2 |
8192 |
0 |
1 |
4 |
1 |
n/a |
n/a |
4798 |
2157 |
DGX-GB200 |
64 |
MXFP8 |
256 |
1 |
8192 |
0 |
2 |
4 |
1 |
5 |
n/a |
3837 |
1724 |
DGX-GB300 |
64 |
FP8 |
256 |
2 |
8192 |
64 |
1 |
1 |
1 |
n/a |
n/a |
5243 |
2353 |
DGX-GB200 |
64 |
FP8 |
256 |
2 |
8192 |
64 |
1 |
1 |
1 |
n/a |
n/a |
4357 |
1956 |
DGX-H100 |
64 |
FP8 |
256 |
1 |
8192 |
0 |
4 |
8 |
1 |
5 |
n/a |
1639 |
736 |
Model: LLAMA3.1_405B#
System |
#-GPUs |
Precision |
GBS |
MBS |
Sequence Length |
FSDP |
TP |
PP |
CP |
VP |
EP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DGX-GB300 |
256 |
NVFP4 |
1536 |
1 |
8192 |
0 |
4 |
8 |
1 |
4 |
n/a |
1358 |
3428 |
DGX-GB200 |
256 |
NVFP4 |
1536 |
1 |
8192 |
0 |
4 |
16 |
1 |
4 |
n/a |
1083 |
2734 |
DGX-GB300 |
256 |
MXFP8 |
1536 |
1 |
8192 |
0 |
2 |
8 |
2 |
4 |
n/a |
949 |
2394 |
DGX-GB200 |
256 |
MXFP8 |
1536 |
1 |
8192 |
0 |
4 |
16 |
1 |
8 |
n/a |
775 |
1957 |
DGX-GB300 |
256 |
FP8 |
1536 |
1 |
8192 |
0 |
2 |
8 |
2 |
4 |
n/a |
1024 |
2585 |
DGX-GB200 |
256 |
FP8 |
1536 |
1 |
8192 |
0 |
4 |
16 |
1 |
4 |
n/a |
818 |
2063 |
Model: DeepSeekV3#
System |
#-GPUs |
Precision |
GBS |
MBS |
Sequence Length |
FSDP |
TP |
PP |
CP |
VP |
EP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DGX-GB300 |
256 |
MXFP8 |
4096 |
2 |
4096 |
0 |
1 |
2 |
1 |
8 |
32 |
4691 |
1219 |
DGX-GB200 |
256 |
MXFP8 |
4096 |
1 |
4096 |
0 |
1 |
4 |
1 |
4 |
64 |
4021 |
1046 |
DGX-B300 |
256 |
MXFP8 |
4096 |
1 |
4096 |
0 |
1 |
16 |
1 |
n/a |
8 |
3099 |
806 |
DGX-B200 |
256 |
MXFP8 |
4096 |
1 |
4096 |
0 |
1 |
16 |
1 |
n/a |
8 |
2790 |
725 |
Model: GPT OSS 120B#
System |
#-GPUs |
Precision |
GBS |
MBS |
Sequence Length |
FSDP |
TP |
PP |
CP |
VP |
EP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DGX-GB300 |
64 |
BF16 |
1280 |
4 |
4096 |
0 |
1 |
1 |
1 |
n/a |
64 |
19366 |
526 |
DGX-GB200 |
64 |
BF16 |
1280 |
4 |
4096 |
0 |
1 |
1 |
1 |
n/a |
64 |
15754 |
428 |
DGX-B300 |
64 |
BF16 |
1280 |
4 |
4096 |
0 |
1 |
1 |
1 |
n/a |
8 |
15031 |
412 |
DGX-B200 |
64 |
BF16 |
1280 |
4 |
4096 |
0 |
1 |
1 |
1 |
n/a |
8 |
13722 |
373 |
DGX-H100 |
64 |
BF16 |
1280 |
1 |
4096 |
0 |
1 |
4 |
1 |
n/a |
8 |
5984 |
163 |
Model: Qwen3_30B_a3B#
System |
#-GPUs |
Precision |
GBS |
MBS |
Sequence Length |
FSDP |
TP |
PP |
CP |
VP |
EP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DGX-GB300 |
8 |
MXFP8 |
512 |
8 |
4096 |
0 |
1 |
1 |
1 |
n/a |
8 |
30411 |
700 |
DGX-GB200 |
8 |
MXFP8 |
512 |
4 |
4096 |
0 |
1 |
1 |
1 |
n/a |
8 |
26373 |
607 |
DGX-B300 |
8 |
MXFP8 |
512 |
8 |
4096 |
0 |
1 |
1 |
1 |
n/a |
8 |
29454 |
678 |
DGX-B200 |
8 |
MXFP8 |
512 |
4 |
4096 |
0 |
1 |
1 |
1 |
n/a |
8 |
26695 |
614 |
DGX-H100 |
16 |
FP8 |
1024 |
1 |
4096 |
0 |
1 |
2 |
1 |
12 |
8 |
9058 |
208 |
Model: Qwen3_235B_a22B#
System |
#-GPUs |
Precision |
GBS |
MBS |
Sequence Length |
FSDP |
TP |
PP |
CP |
VP |
EP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DGX-GB300 |
256 |
MXFP8 |
8192 |
2 |
4096 |
0 |
1 |
4 |
1 |
n/a |
32 |
6583 |
974 |
DGX-GB200 |
256 |
MXFP8 |
8192 |
1 |
4096 |
0 |
1 |
8 |
1 |
n/a |
32 |
5530 |
819 |
DGX-B300 |
256 |
MXFP8 |
8192 |
1 |
4096 |
0 |
1 |
8 |
1 |
4 |
8 |
2644 |
391 |
DGX-H100 |
256 |
FP8 |
8192 |
1 |
4096 |
0 |
2 |
8 |
1 |
4 |
32 |
1611 |
238 |
Model: Nemotron_3_Nano#
System |
#-GPUs |
Precision |
GBS |
MBS |
Sequence Length |
FSDP |
TP |
PP |
CP |
VP |
EP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DGX-GB300 |
8 |
MXFP8 |
512 |
4 |
8192 |
0 |
1 |
1 |
1 |
n/a |
8 |
37664 |
839 |
DGX-GB200 |
8 |
MXFP8 |
512 |
2 |
8192 |
0 |
1 |
1 |
1 |
n/a |
8 |
33934 |
756 |
DGX-B300 |
8 |
MXFP8 |
512 |
4 |
8192 |
0 |
1 |
1 |
1 |
n/a |
8 |
35861 |
798 |
DGX-H100 |
16 |
FP8 |
1024 |
1 |
8192 |
0 |
1 |
1 |
1 |
n/a |
8 |
14890 |
331 |
Model: Kimi_K2#
System |
#-GPUs |
Precision |
GBS |
MBS |
Sequence Length |
FSDP |
TP |
PP |
CP |
VP |
EP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DGX-GB300 |
256 |
MXFP8 |
4096 |
2 |
4096 |
0 |
1 |
4 |
1 |
4 |
64 |
5072 |
1037 |
Muon optimizer was used for pre-training Kimi-K2.
In MoE training benchmarks, we force-balance the token distribution among experts and all benchmarks are token-dropless.
Archive#
Performance summary for past releases can be found in the archive.