Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Performance
The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.
Below, you can see performance benchmarks for various large language models.
Performance Summary for Large Language Models
Pretraining
The table below shows the pre-training performance of various models at FP8 precision.
Container: NeMo24.07
System: DGX-H100
Model |
#-GPUs |
GBS |
MBS |
Sequence Length |
TP |
PP |
CP |
VP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
Est. time to train in days (10T tokens, 1K GPUs) |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT3-5B |
64 |
2048 |
4 |
2048 |
1 |
1 |
1 |
1 |
23406 |
765 |
5 |
GPT3-20B |
64 |
256 |
2 |
2048 |
2 |
1 |
1 |
1 |
5851 |
750 |
19 |
GPT3-175B |
128 |
256 |
1 |
2048 |
4 |
8 |
1 |
6 |
716 |
771 |
158 |
GPT3-175B |
512 |
2048 |
2 |
2048 |
4 |
8 |
1 |
6 |
825 |
137 |
|
LLAMA2-7B |
8 |
128 |
1 |
4096 |
1 |
1 |
1 |
1 |
16934 |
780 |
7 |
LLAMA2-13B |
16 |
128 |
1 |
4096 |
1 |
4 |
1 |
10 |
8715 |
760 |
13 |
LLAMA2-70B |
64 |
128 |
1 |
4096 |
4 |
4 |
1 |
20 |
1728 |
768 |
65 |
Nemotron-8B |
64 |
256 |
4 |
4096 |
2 |
1 |
1 |
1 |
12507 |
643 |
9 |
Nemotron-22B |
64 |
256 |
2 |
4096 |
2 |
4 |
1 |
10 |
4312 |
562 |
26 |
Nemotron-340B |
128 |
32 |
1 |
4096 |
8 |
8 |
1 |
12 |
326 |
686 |
347 |
LLAMA3-8B |
8 |
128 |
1 |
8192 |
1 |
1 |
2 |
1 |
12273 |
711 |
9 |
LLAMA3-70B |
64 |
128 |
1 |
8192 |
4 |
4 |
2 |
5 |
1524 |
734 |
74 |
Fine-Tuning
The table below presents the fine-tuning performance of LLaMA2 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision.
Container: NeMo24.07
System: DGX-H100
For fine-tuning, we use the SQuAD-v1.1 dataset, with inputs packed to 4096 tokens.
Model |
Task |
#-GPUs |
GBS |
MBS |
Packed Sequence Length |
TP |
PP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
Est. time to complete in mins (10M tokens) |
---|---|---|---|---|---|---|---|---|---|---|
LLAMA2-7B |
SFT |
8 |
32 |
1 |
4096 |
1 |
1 |
16891 |
673 |
1.2 |
LLAMA2-13B |
SFT |
8 |
32 |
1 |
4096 |
1 |
4 |
10176 |
787 |
2.0 |
LLAMA2-70B |
SFT |
16 |
32 |
1 |
4096 |
4 |
4 |
1816 |
749 |
5.7 |
LLAMA2-7B |
LoRA |
8 |
32 |
1 |
4096 |
1 |
1 |
24824 |
663 |
0.8 |
LLAMA2-13B |
LoRA |
8 |
32 |
1 |
4096 |
1 |
1 |
14629 |
757 |
1.4 |
LLAMA2-70B |
LoRA |
8 |
32 |
1 |
4096 |
2 |
4 |
2621 |
722 |
7.9 |
Example Scripts for Pretraining and Fine-tuning
These scripts run a recommended configuration for GPT-3, LLaMA-2, NeMo pretraining, and fine-tuning for various model sizes on A100 and H100. For example, for GPT-3 pretraining, the following folders provide sample scripts:
A100 : Scripts to run GPT pretraining on NVIDIA A100, in bf16 data type.
H100 : Scripts to run GPT pretraining for NVIDIA H100, in fp8 data type.
Set Up Example Scripts
To run these scripts, you must have access to the NeMo Framework Container. Please sign in at NGC (user = ea-bignlp/ga-participants) to access the catalog.
Update the following bash variables in the example run scripts:
NEMO_MEGATRON_LAUNCHER_DIR
: the directory of where this repository is located.DATA_DIR
: the directory of the dataset used for pretraining, by default this isNEMO_MEGATRON_LAUNCHER_DIR/data
.
Enter your cluster environment settings at config.yaml.
For bcm type clusters update the job name, partition, and account at bcm.yaml.
For testing performance with synthetic data on an interactive node, you need to add the following options to your bash script:
cluster_type=interactive \ ++training.cluster_type=BCP \ training.model.data.data_impl="mock" \ training.model.data.data_prefix=[]
Collect Results
For performance, the “step_time_per_sec” variable on the console out provides a quick way to read performance of a workload.
For more details and graphics, you can use TensorBoard or Weights and Biases. To do so, use the results stored in NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>
with the following structure:
NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<experiment_name>.yaml
: The config of the pretrained model.NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<jobname>_<experiment_name>.sh
: The autogenerated .sh file that was run.NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/results/
: Directory contained per rank logs, and tensorboard data.