Performance#
The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.
Below, you can see performance benchmarks for various large language models.
Performance Summary for Large Language Models#
Pre-Training Performance#
The table below highlights the pre-training performance of various models at FP8 precision with NeMo 2.0.
Container: NeMo 24.12
System: DGX-H100
Model |
#-GPUs |
GBS |
MBS |
Sequence Length |
TP |
PP |
CP |
VP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
Est. time to train in days (10T tokens, 1K GPUs) |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT3-175B |
128 |
256 |
1 |
2048 |
4 |
8 |
1 |
6 |
794 |
854 (dropout > 0) |
142 |
GPT3-175B |
512 |
2048 |
2 |
2048 |
4 |
8 |
1 |
6 |
850 |
133 |
|
LLAMA3-8B |
8 |
128 |
1 |
8192 |
1 |
1 |
2 |
1 |
14064 |
814 |
8 |
LLAMA3-70B |
64 |
128 |
1 |
8192 |
4 |
4 |
2 |
5 |
1633 |
786 |
69 |
LLAMA3-405B |
576 |
252 |
1 |
8192 |
8 |
9 |
2 |
7 |
312 |
827 |
362 |
Nemotron-8B |
64 |
256 |
4 |
4096 |
2 |
1 |
1 |
1 |
13003 |
668 |
9 |
Nemotron-15B |
64 |
256 |
4 |
4096 |
4 |
1 |
1 |
1 |
7550 |
710 |
15 |
Nemotron-22B |
64 |
256 |
2 |
4096 |
2 |
4 |
1 |
10 |
5831 |
759 |
19 |
Nemotron-340B |
128 |
32 |
1 |
4096 |
8 |
8 |
1 |
12 |
367 |
773 |
308 |
Fine-Tuning Performance#
The table below highlights the fine-tuning performance for Llama3 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision with NeMo 2.0.
Container: NeMo 24.12
System: DGX-H100
For fine-tuning, we use the SQuAD-v1.1 dataset with inputs packed to 4096 tokens.
Model |
Task |
#-GPUs |
GBS |
MBS |
Packed Sequence Length |
TP |
PP |
VP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
Est. time to complete in mins (10M tokens) |
---|---|---|---|---|---|---|---|---|---|---|---|
LLAMA3-8B |
SFT |
8 |
32 |
1 |
4096 |
1 |
1 |
1 |
16891 |
763 |
1.23 |
LLAMA3-70B |
SFT |
32 |
32 |
1 |
4096 |
4 |
4 |
5 |
1672 |
697 |
3.12 |
LLAMA3-8B |
LoRA |
8 |
32 |
1 |
4096 |
1 |
1 |
1 |
23406 |
707 |
0.89 |
LLAMA3-70B |
LoRA |
8 |
32 |
1 |
4096 |
2 |
4 |
20 |
2758 |
768 |
7.55 |
LLAMA3-405B |
LoRA |
24 |
24 |
1 |
2048 |
4 |
6 |
7 |
509 |
827 |
13.63 |