Performance#
The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.
Performance Summary for Large Language Models#
Below are performance benchmarks for various large language models. These results were obtained using a version of the performance recipes available here.
Abbreviations:
MBS: Micro Batch Size
GBS: Global Batch Size
TP: Tensor Parallel Size
PP: Pipeline Parallel Size
CP: Context Parallel Size
VP: Virtual Pipeline Parallel Size
EP: Expert Parallel Size
GA: Number of Gradient Accumulations
Pretraining#
The table below shows the pre-training performance of various models at FP8 precision (using NeMo 2.0).
Container: NeMo 25.02
System: DGX-B200
Model |
#-GPUs |
GBS |
MBS |
Sequence Length |
TP |
PP |
CP |
VP |
EP |
GA |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT3-175B |
512 |
2048 |
2 |
2048 |
4 |
4 |
1 |
6 |
1 |
32 |
1600 |
|
LLAMA3-8B |
8 |
128 |
2 |
8192 |
1 |
1 |
1 |
1 |
1 |
8 |
26006 |
1506 |
LLAMA3-70B |
64 |
128 |
1 |
8192 |
2 |
4 |
2 |
5 |
1 |
32 |
3062 |
1474 |
LLAMA3-405B |
128 |
64 |
1 |
8192 |
4 |
8 |
2 |
8 |
1 |
32 |
625 |
1658 |
Nemotron-15B |
64 |
256 |
2 |
4096 |
1 |
1 |
1 |
1 |
1 |
2 |
14760 |
1387 |
Nemotron-340B |
128 |
32 |
1 |
4096 |
8 |
4 |
1 |
12 |
1 |
8 |
602 |
1268 |
Mixtral-8x7B |
64 |
256 |
2 |
4096 |
1 |
1 |
1 |
1 |
8 |
2 |
15457 |
1282 |
Mixtral-8x22B |
256 |
1 |
64 |
65536 |
2 |
4 |
8 |
14 |
8 |
16 |
2232 |
824 |
System: DGX-H100
Model |
#-GPUs |
GBS |
MBS |
Sequence Length |
TP |
PP |
CP |
VP |
EP |
GA |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT3-175B |
512 |
2048 |
2 |
2048 |
4 |
8 |
1 |
6 |
1 |
64 |
866 |
|
LLAMA3-8B |
8 |
128 |
1 |
8192 |
1 |
1 |
2 |
1 |
1 |
32 |
14201 |
822 |
LLAMA3-70B |
64 |
128 |
1 |
8192 |
4 |
8 |
1 |
5 |
1 |
64 |
1662 |
800 |
LLAMA3-405B |
1024 |
512 |
1 |
8192 |
8 |
8 |
2 |
8 |
1 |
64 |
327 |
866 |
Nemotron-15B |
64 |
256 |
2 |
4096 |
2 |
1 |
1 |
1 |
1 |
4 |
8233 |
774 |
Nemotron-340B |
256 |
64 |
1 |
4096 |
8 |
8 |
1 |
12 |
1 |
16 |
346 |
728 |
Mixtral-8x7B |
64 |
256 |
1 |
4096 |
1 |
4 |
1 |
8 |
8 |
16 |
8233 |
683 |
Mixtral-8x22B |
256 |
1 |
256 |
65536 |
4 |
4 |
8 |
14 |
8 |
32 |
1278 |
471 |
Fine-Tuning#
The table below presents the fine-tuning performance of LLaMA2 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision (using NeMo 2.0).
Container: NeMo 25.02
For fine-tuning, we use the SQuAD-v1.1 dataset, with inputs packed to 4096 tokens.
System: DGX-B200
Model |
Task |
#-GPUs |
GBS |
MBS |
Packed Sequence Length |
TP |
PP |
VP |
GA |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
---|---|---|---|---|---|---|---|---|---|---|---|
LLAMA3-8B |
SFT |
8 |
8 |
1 |
16384 |
1 |
1 |
1 |
1 |
31508 |
1428 |
LLAMA3-70B |
SFT |
32 |
32 |
1 |
4096 |
2 |
4 |
5 |
8 |
3357 |
1400 |
LLAMA3-8B |
LoRA |
8 |
8 |
1 |
16384 |
1 |
1 |
1 |
1 |
43116 |
1307 |
LLAMA3-70B |
LoRA |
8 |
32 |
1 |
4096 |
1 |
4 |
20 |
16 |
5669 |
1579 |
LLAMA3-405B |
LoRA |
32 |
32 |
1 |
2048 |
4 |
4 |
4 |
16 |
759 |
1231 |
System: DGX-H100
Model |
Task |
#-GPUs |
GBS |
MBS |
Packed Sequence Length |
TP |
PP |
VP |
GA |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
---|---|---|---|---|---|---|---|---|---|---|---|
LLAMA3-8B |
SFT |
8 |
32 |
1 |
4096 |
1 |
1 |
1 |
4 |
17246 |
779 |
LLAMA3-70B |
SFT |
32 |
32 |
1 |
4096 |
4 |
4 |
5 |
16 |
1789 |
746 |
LLAMA3-8B |
LoRA |
8 |
32 |
1 |
4096 |
1 |
1 |
1 |
4 |
23406 |
707 |
LLAMA3-70B |
LoRA |
8 |
32 |
1 |
4096 |
2 |
4 |
20 |
32 |
2768 |
771 |
LLAMA3-405B |
LoRA |
32 |
32 |
1 |
2048 |
4 |
8 |
8 |
32 |
521 |
846 |