Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Performance#

The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.

Below, you can see performance benchmarks for various large language models.

Performance Summary for Large Language Models#

Pretraining#

The table below shows the pre-training performance of various models at FP8 precision (using NeMo 2.0).

Model

#-GPUs

GBS

MBS

Sequence Length

TP

PP

CP

VP

EP

Tokens / sec / GPU

Model TFLOP / sec / GPU

Est. time to train in days (10T tokens, 1K GPUs)

GPT3-175B

512

2048

2

2048

4

4

1

6

1

1600

1722

71

LLAMA3-8B

8

128

2

8192

1

1

1

1

1

26006

1506

4

LLAMA3-70B

64

128

1

8192

2

4

2

5

1

3062

1474

37

LLAMA3-405B

128

64

1

8192

4

8

2

8

1

625

1658

181

Nemotron-15B

64

256

2

4096

1

1

1

1

1

14760

1387

8

Nemotron-340B

128

32

1

4096

8

4

1

12

1

602

1268

188

Mixtral-8x7B

64

256

2

4096

1

1

1

1

8

15457

1282

7

Mixtral-8x22B

256

1

64

65536

2

4

8

14

8

2232

824

51

  • System: DGX-H100

Model

#-GPUs

GBS

MBS

Sequence Length

TP

PP

CP

VP

EP

Tokens / sec / GPU

Model TFLOP / sec / GPU

Est. time to train in days (10T tokens, 1K GPUs)

GPT3-175B

512

2048

2

2048

4

8

1

6

1

866

932

131

LLAMA3-8B

8

128

1

8192

1

1

2

1

1

14201

822

8

LLAMA3-70B

64

128

1

8192

4

8

1

5

1

1662

800

68

LLAMA3-405B

1024

512

1

8192

8

8

2

8

1

327

866

346

Nemotron-15B

64

256

2

4096

2

1

1

1

1

8233

774

14

Nemotron-340B

256

64

1

4096

8

8

1

12

1

346

728

327

Mixtral-8x7B

64

256

1

4096

1

4

1

8

8

8233

683

14

Mixtral-8x22B

256

1

256

65536

4

4

8

14

8

1278

471

88

Fine-Tuning#

The table below presents the fine-tuning performance of LLaMA2 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision (using NeMo 2.0).

For fine-tuning, we use the SQuAD-v1.1 dataset, with inputs packed to 4096 tokens.

  • System: DGX-B200

Model

Task

#-GPUs

GBS

MBS

Packed Sequence Length

TP

PP

VP

Tokens / sec / GPU

Model TFLOP / sec / GPU

Est. time to complete in mins (10M tokens)

LLAMA3-8B

SFT

8

8

1

16384

1

1

1

31508

1428

0.66

LLAMA3-70B

SFT

32

32

1

4096

2

4

5

3357

1400

1.55

LLAMA3-8B

LoRA

8

8

1

16384

1

1

1

43116

1307

0.48

LLAMA3-70B

LoRA

8

32

1

4096

1

4

20

5669

1579

3.67

LLAMA3-405B

LoRA

32

32

1

2048

4

4

4

759

1231

6.87

  • System: DGX-H100

Model

Task

#-GPUs

GBS

MBS

Packed Sequence Length

TP

PP

VP

Tokens / sec / GPU

Model TFLOP / sec / GPU

Est. time to complete in mins (10M tokens)

LLAMA3-8B

SFT

8

32

1

4096

1

1

1

17246

779

1.21

LLAMA3-70B

SFT

32

32

1

4096

4

4

5

1789

746

2.91

LLAMA3-8B

LoRA

8

32

1

4096

1

1

1

23406

707

0.89

LLAMA3-70B

LoRA

8

32

1

4096

2

4

20

2768

771

7.53

LLAMA3-405B

LoRA

32

32

1

2048

4

8

8

521

846

9.99

Performance Numbers for previous NeMo Container Releases#

24.12 NeMo container