Performance#
The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.
Performance Summary for Large Language Models#
Below are performance benchmarks for various large language models. These results were obtained using a version of the performance recipes available here.
Abbreviations:
GBS: Global Batch Size
MBS: Micro Batch Size
FSDP: Fully Sharded Data Parallel
FSDP = 1: use FSDP
FSDP = 0: use DDP (Distributed Data Parallel)
TP: Tensor Parallel Size
PP: Pipeline Parallel Size
CP: Context Parallel Size
VP: Virtual Pipeline Parallel Size
EP: Expert Parallel Size
GA: Number of Gradient Accumulations
Pre-Training Performance#
The table below summarizes the pre-training performance of various models using FP8 precision. We apply per-tensor FP8 quantization, leveraging scaling factors computed in the current step for both pre-training and fine-tuning.
Container: NeMo 25.07
System: DGX-GB200
Model |
#-GPUs |
GBS |
MBS |
Sequence Length |
FSDP |
TP |
PP |
CP |
VP |
EP |
GA |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLAMA3-8B |
8 |
128 |
2 |
8192 |
0 |
1 |
1 |
1 |
1 |
1 |
8 |
31737 (30341) |
1838 (1757) |
LLAMA3-70B |
64 |
128 |
1 |
8192 |
1 (0) |
1 (2) |
1 (4) |
1 (2) |
1 (5) |
1 |
2 (32) |
4066 (2605) |
1957 (1254) |
LLAMA3.1-405B |
128 |
64 |
1 |
8192 |
1 (0) |
2 (4) |
1 (8) |
1 (2) |
1 (8) |
1 |
1 (32) |
706 (638) |
1854 (1675) |
LLAMA4-Scout-LLM |
64 |
1024 |
1 |
8192 |
0 |
1 |
1 |
1 |
1 |
16 |
16 |
13783 (10717) |
1501 (1167) |
LLAMA4-Maverick-LLM |
128 |
1024 |
1 |
8192 |
0 |
1 |
2 |
1 |
12 |
64 |
16 |
11682 (12025) |
1272 (1309) |
Mixtral-8x7B |
64 |
256 |
2 |
4096 |
0 |
1 |
1 |
1 |
1 |
8 |
2 |
17246 (16549) |
1430 (1372) |
Mixtral-8x22B |
256 |
64 |
1 |
65536 |
0 |
2 |
4 |
8 |
14 |
8 |
16 |
2869 (3097) |
1059 (1143) |
Nemotron5-H-56B |
64 |
192 |
1 |
8192 |
0 |
2 |
1 |
1 |
1 |
1 |
6 |
4690 |
1988 |
DeepSeekV3 |
256 |
2048 |
1 |
4096 |
0 |
2 |
4 |
1 |
1 |
64 |
64 |
2327 (2312) |
606 (601) |
The numbers in parenthesis here represent configuration and performance for MXFP8.
MXFP8 uses 1x32 block quantization for both activation and weights.
System: DGX-B200
Model |
#-GPUs |
GBS |
MBS |
Sequence Length |
FSDP |
TP |
PP |
CP |
VP |
EP |
GA |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LLAMA3-8B |
8 |
128 |
2 |
8192 |
0 |
1 |
1 |
1 |
1 |
1 |
8 |
30131 (28934) |
1745 (1676) |
LLAMA3-70B |
64 |
128 |
1 |
8192 |
1 (0) |
1 (2) |
1 (4) |
1 (2) |
1 (5) |
1 |
2 (32) |
3690 (3133) |
1777 (1508) |
LLAMA3.1-405B |
128 |
64 |
1 |
8192 |
0 |
4 |
8 |
2 |
8 |
1 |
32 |
674 (624) |
1769 (1639) |
LLAMA4-Scout-LLM |
64 |
1024 |
1 |
8192 |
0 |
1 |
2 |
1 |
24 |
8 |
32 |
11260 (10806) |
1226 (1177) |
LLAMA4-Maverick-LLM |
128 |
1024 |
1 |
8192 |
0 |
1 |
2 |
1 |
12 |
64 |
16 |
9811 (9870) |
1068 (1075) |
Mixtral-8x7B |
64 |
256 |
2 |
4096 |
0 |
1 |
1 |
1 |
1 |
8 |
2 |
16384 (15170) |
1359 (1258) |
Mixtral-8x22B |
256 |
64 |
1 |
65536 |
0 |
2 |
4 |
8 |
14 |
8 |
16 |
2548 (2475) |
940 (913) |
Nemotron5-H-56B |
64 |
192 |
2 |
8192 |
0 |
4 |
1 |
1 |
1 |
1 |
6 |
4123 |
1748 |
DeepSeekV3 |
256 |
2048 |
1 |
4096 |
0 |
2 |
16 |
1 |
1 |
8 |
256 |
1640 (1577) |
426 (410) |
The numbers in parenthesis here represent configuration and performance for MXFP8.
MXFP8 uses 1x32 block quantization for both activation and weights.
System: DGX-H100
Model |
#-GPUs |
GBS |
MBS |
Sequence Length |
TP |
PP |
CP |
VP |
EP |
GA |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LLAMA3-8B |
8 |
128 |
1 |
8192 |
1 |
1 |
2 |
1 |
1 |
32 |
14340 |
830 |
LLAMA3-70B |
64 |
128 |
1 |
8192 |
4 |
8 |
1 |
5 |
1 |
64 |
1663 |
801 |
LLAMA3.1-405B |
1024 |
512 |
1 |
8192 |
8 |
8 |
2 |
8 |
1 |
64 |
320 |
840 |
LLAMA4-Scout-LLM |
256 |
1024 |
1 |
8192 |
4 |
1 |
1 |
1 |
16 |
16 |
4239 |
462 |
LLAMA4-Maverick-LLM |
512 |
1024 |
1 |
8192 |
4 |
1 |
1 |
1 |
128 |
8 |
4602 |
501 |
Mixtral-8x7B |
64 |
256 |
1 |
4096 |
1 |
4 |
1 |
8 |
8 |
16 |
8275 |
686 |
Mixtral-8x22B |
256 |
256 |
1 |
65536 |
4 |
4 |
8 |
14 |
8 |
32 |
1280 |
472 |
Nemotron5-H-56B |
64 |
192 |
1 |
8192 |
8 |
1 |
1 |
1 |
1 |
24 |
1980 |
839 |
DeepSeekV3 |
1024 |
8192 |
1 |
4096 |
2 |
16 |
1 |
2 |
64 |
256 |
887 (850) |
230 (225) |
The numbers in parentheses for DeepSeekV3 indicate the use of different quantization granularities: 128×128 for weights and 1×128 for activations, which match those used in the original DeepSeekV3 pre-training.
Fine-Tuning Performance#
The table below highlights the fine-tuning performance for Llama3 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision with NeMo 2.0.
Container: NeMo 25.07
For fine-tuning, we use the SQuAD-v1.1 dataset with inputs packed to 4096 tokens.
System: DGX-GB200
Model |
Task |
#-GPUs |
GBS |
MBS |
Packed Sequence Length |
TP |
PP |
VP |
GA |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
---|---|---|---|---|---|---|---|---|---|---|---|
LLAMA3-8B |
SFT |
8 |
8 |
1 |
16384 |
1 |
1 |
1 |
1 |
34133 (32768) |
1547 (1485) |
LLAMA3-70B |
SFT |
32 |
32 |
1 |
4096 |
2 |
4 |
5 |
8 |
4267 (4096) |
1779 (1708) |
LLAMA3-70B |
LoRA |
8 |
64 |
1 |
2048 |
1 |
4 |
20 |
32 |
3633 (3257) |
1008 (903) |
The numbers in parenthesis here represent performance for MXFP8.
MXFP8 uses 1x32 block quantization for both activation and weights.
System: DGX-B200
Model |
Task |
#-GPUs |
GBS |
MBS |
Packed Sequence Length |
TP |
PP |
VP |
GA |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
---|---|---|---|---|---|---|---|---|---|---|---|
LLAMA3-8B |
SFT |
8 |
8 |
1 |
16384 |
1 |
1 |
1 |
1 |
32125 (31508) |
1456 (1428) |
LLAMA3-70B |
SFT |
32 |
32 |
1 |
4096 |
2 |
4 |
5 |
8 |
4180 (3864) |
1743 (1611) |
LLAMA3-70B |
LoRA |
8 |
32 |
1 |
4096 |
1 |
4 |
20 |
16 |
5729 (5729) |
1596 (1596) |
The numbers in parenthesis here represent performance for MXFP8.
MXFP8 uses 1x32 block quantization for both activation and weights.
System: DGX-H100
Model |
Task |
#-GPUs |
GBS |
MBS |
Packed Sequence Length |
TP |
PP |
VP |
GA |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
---|---|---|---|---|---|---|---|---|---|---|---|
LLAMA3-8B |
SFT |
8 |
32 |
1 |
4096 |
1 |
1 |
1 |
4 |
18618 |
841 |
LLAMA3-70B |
SFT |
32 |
32 |
1 |
4096 |
4 |
4 |
5 |
16 |
1862 |
776 |
LLAMA3-70B |
LoRA |
8 |
32 |
1 |
4096 |
2 |
4 |
20 |
32 |
2754 |
767 |