Performance#

The NVIDIA NeMo Framework accelerates the entire AI workflow end-to-end, from data preparation to model training to inference. It provides optimal performance for training advanced generative AI models by incorporating the most recent training techniques, such as model parallelization, optimized attention mechanisms, and more, to achieve high training throughput. For inference, the NeMo Framework provides a path that leverages TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs.

Performance Summary for Large Language Models#

Below are performance benchmarks for various large language models. These results were obtained using a version of the performance recipes available here.

  • Abbreviations:

    • GBS: Global Batch Size

    • MBS: Micro Batch Size

    • FSDP: Fully Sharded Data Parallel

      • FSDP = 1: use FSDP

      • FSDP = 0: use DDP (Distributed Data Parallel)

    • TP: Tensor Parallel Size

    • PP: Pipeline Parallel Size

    • CP: Context Parallel Size

    • VP: Virtual Pipeline Parallel Size

    • EP: Expert Parallel Size

    • GA: Number of Gradient Accumulations

Pre-Training Performance#

The table below summarizes the pre-training performance of various models using FP8 precision. We apply per-tensor FP8 quantization, leveraging scaling factors computed in the current step for both pre-training and fine-tuning.

Model

#-GPUs

GBS

MBS

Sequence Length

FSDP

TP

PP

CP

VP

EP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

LLAMA3-8B

8

128

2

8192

0

1

1

1

1

1

8

31737 (30341)

1838 (1757)

LLAMA3-70B

64

128

1

8192

1 (0)

1 (2)

1 (4)

1 (2)

1 (5)

1

2 (32)

4066 (2605)

1957 (1254)

LLAMA3.1-405B

128

64

1

8192

1 (0)

2 (4)

1 (8)

1 (2)

1 (8)

1

1 (32)

706 (638)

1854 (1675)

LLAMA4-Scout-LLM

64

1024

1

8192

0

1

1

1

1

16

16

13783 (10717)

1501 (1167)

LLAMA4-Maverick-LLM

128

1024

1

8192

0

1

2

1

12

64

16

11682 (12025)

1272 (1309)

Mixtral-8x7B

64

256

2

4096

0

1

1

1

1

8

2

17246 (16549)

1430 (1372)

Mixtral-8x22B

256

64

1

65536

0

2

4

8

14

8

16

2869 (3097)

1059 (1143)

Nemotron5-H-56B

64

192

1

8192

0

2

1

1

1

1

6

4690

1988

DeepSeekV3

256

2048

1

4096

0

2

4

1

1

64

64

2327 (2312)

606 (601)

  • The numbers in parenthesis here represent configuration and performance for MXFP8.

  • MXFP8 uses 1x32 block quantization for both activation and weights.

  • System: DGX-B200

Model

#-GPUs

GBS

MBS

Sequence Length

FSDP

TP

PP

CP

VP

EP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

LLAMA3-8B

8

128

2

8192

0

1

1

1

1

1

8

30131 (28934)

1745 (1676)

LLAMA3-70B

64

128

1

8192

1 (0)

1 (2)

1 (4)

1 (2)

1 (5)

1

2 (32)

3690 (3133)

1777 (1508)

LLAMA3.1-405B

128

64

1

8192

0

4

8

2

8

1

32

674 (624)

1769 (1639)

LLAMA4-Scout-LLM

64

1024

1

8192

0

1

2

1

24

8

32

11260 (10806)

1226 (1177)

LLAMA4-Maverick-LLM

128

1024

1

8192

0

1

2

1

12

64

16

9811 (9870)

1068 (1075)

Mixtral-8x7B

64

256

2

4096

0

1

1

1

1

8

2

16384 (15170)

1359 (1258)

Mixtral-8x22B

256

64

1

65536

0

2

4

8

14

8

16

2548 (2475)

940 (913)

Nemotron5-H-56B

64

192

2

8192

0

4

1

1

1

1

6

4123

1748

DeepSeekV3

256

2048

1

4096

0

2

16

1

1

8

256

1640 (1577)

426 (410)

  • The numbers in parenthesis here represent configuration and performance for MXFP8.

  • MXFP8 uses 1x32 block quantization for both activation and weights.

  • System: DGX-H100

Model

#-GPUs

GBS

MBS

Sequence Length

TP

PP

CP

VP

EP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

LLAMA3-8B

8

128

1

8192

1

1

2

1

1

32

14340

830

LLAMA3-70B

64

128

1

8192

4

8

1

5

1

64

1663

801

LLAMA3.1-405B

1024

512

1

8192

8

8

2

8

1

64

320

840

LLAMA4-Scout-LLM

256

1024

1

8192

4

1

1

1

16

16

4239

462

LLAMA4-Maverick-LLM

512

1024

1

8192

4

1

1

1

128

8

4602

501

Mixtral-8x7B

64

256

1

4096

1

4

1

8

8

16

8275

686

Mixtral-8x22B

256

256

1

65536

4

4

8

14

8

32

1280

472

Nemotron5-H-56B

64

192

1

8192

8

1

1

1

1

24

1980

839

DeepSeekV3

1024

8192

1

4096

2

16

1

2

64

256

887 (850)

230 (225)

  • The numbers in parentheses for DeepSeekV3 indicate the use of different quantization granularities: 128×128 for weights and 1×128 for activations, which match those used in the original DeepSeekV3 pre-training.

Fine-Tuning Performance#

The table below highlights the fine-tuning performance for Llama3 models using Supervised Fine-Tuning (SFT) and Low-Rank Adaptors (LoRA) at FP8 precision with NeMo 2.0.

For fine-tuning, we use the SQuAD-v1.1 dataset with inputs packed to 4096 tokens.

  • System: DGX-GB200

Model

Task

#-GPUs

GBS

MBS

Packed Sequence Length

TP

PP

VP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

LLAMA3-8B

SFT

8

8

1

16384

1

1

1

1

34133 (32768)

1547 (1485)

LLAMA3-70B

SFT

32

32

1

4096

2

4

5

8

4267 (4096)

1779 (1708)

LLAMA3-70B

LoRA

8

64

1

2048

1

4

20

32

3633 (3257)

1008 (903)

  • The numbers in parenthesis here represent performance for MXFP8.

  • MXFP8 uses 1x32 block quantization for both activation and weights.

  • System: DGX-B200

Model

Task

#-GPUs

GBS

MBS

Packed Sequence Length

TP

PP

VP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

LLAMA3-8B

SFT

8

8

1

16384

1

1

1

1

32125 (31508)

1456 (1428)

LLAMA3-70B

SFT

32

32

1

4096

2

4

5

8

4180 (3864)

1743 (1611)

LLAMA3-70B

LoRA

8

32

1

4096

1

4

20

16

5729 (5729)

1596 (1596)

  • The numbers in parenthesis here represent performance for MXFP8.

  • MXFP8 uses 1x32 block quantization for both activation and weights.

  • System: DGX-H100

Model

Task

#-GPUs

GBS

MBS

Packed Sequence Length

TP

PP

VP

GA

Tokens / sec / GPU

Model TFLOP / sec / GPU

LLAMA3-8B

SFT

8

32

1

4096

1

1

1

4

18618

841

LLAMA3-70B

SFT

32

32

1

4096

4

4

5

16

1862

776

LLAMA3-70B

LoRA

8

32

1

4096

2

4

20

32

2754

767

Performance Numbers for Previous NeMo Container Releases#

Performance Summary Archive