Performance#

As part of the NVIDIA NeMo Framework, NeMo RL, provides optimal performance for reinforcement learning on generative AI models by incorporating the latest optimizations - such as refit optimizations, mixed-precision training, and off-policy training.

This page provides performance benchmarks for LLMs and VLMs using NeMo RL across different GPU systems and configurations.

Nomenclature#

  • GBS: Global Batch Size

  • MBS: Micro Batch Size

  • TP: Tensor Parallel Size

  • PP: Pipeline Parallel Size

  • CP: Context Parallel Size

  • VP: Virtual Pipeline Parallel Size

  • EP: Expert Parallel Size

  • T-: Training related

  • G-: Generation related

  • Training backend: NeMo RL have two training backends: Megatron and PyTorch DTensor. This performance summary currently only shows number from Megatron backend.

Performance Metrics#

Since reinforcement learning consists of training, generation and transition between the two, performance measurement also reflects this. Specifically, we track the following metrics:

  • Step time: Time for each step, which includes training, generation, policy logprobs, and refit time.

  • Tokens/sec/GPU: The rate at the tokens are processed by a stage (such as training, generation, or refitting) on a single GPU:

    \[ \text{Tokens/sec/GPU} = \frac{\text{Total Tokens Processed}}{\text{Time for Stage} \times \text{Number of GPUs}} \]
  • Training MFU: Model floating-point operations per second per GPU

Performance Summary for Large Language Models#

Below are performance benchmarks for various large language models organized by release version. These results were obtained using performance recipes available here.

The performance data includes:

  • RL Performance: Performance metrics for various model sizes and architectures on different RL algorithms (GRPO and in the future DAPO, PPO, for both on-policy and asynchronous).

  • System Configurations: Results across different GPU systems (DGX-H100 and in the future DGX-GB200, DGX-B200)

  • Precision Options: Performance comparisons between different precision modes (BF16, FP8)


Nemo RL v0.4#

  • GRPO Dataset: OpenMathInstruct-2

  • System: DGX-H100

  • Precision: Training BF16, Generation BF16

  • Training Backend: Megatron-core.

Model

On/Off policy

T-Max Sequence Length

G-Average Seq len

#-GPUs

G-GBS

T-GBS

Generation [TP,PP]

Training [TP,CP,EP,PP,VPP]

Tokens / sec / GPU

Total Step time(s)

LLAMA3.1_8B

On policy

4,096

1,060

16

2,048

512

[1,1]

[1,1,1,1,1,2,n/a]

1,562

97.7

LLAMA3.1_8B

1-step Off

4,096

1,129

16

2,048

512

[1,1]

[1,1,1,1,1,2,n/a]

2,161

74.6

DeepSeek V3

On policy

1,536

745

256

512

512

[32,1]

[1,1,16,16,n/a]

11

154

DeepSeek V3

1-step Off

1,536

744

512

512

512

[32,1]

[1,1,16,16,n/a]

11.0

77.9

Qwen3-235B

On policy

8,192

5,671

128

512

512

[16,1]

[2,2,16,8,n/a]

45.7

506

Qwen3-235B

1-step Off

8,192

5,691

256

512

512

[8,1]

[4,1,16,8,n/a]

52.2

241

Qwen3-30B3A

On policy

4,096

3,154

32

2,048

512

[4,1]

[2,1,8,1,n/a]

925

225

Qwen3-30B3A

1-step Off

4,096

3,158

32

2,048

512

[4,1]

[2,1,8,1,n/a]

864

244

Qwen3-32B

On policy

4,096

3,206

32

2,048

512

[4,1]

[4,1,1,4,n/a]

540

393

Qwen3-32B

1-step Off

4,096

3,207

64

2,048

512

[4,1]

[4,1,1,4,n/a]

494

215

Note:

  • All Mixture-of-expert (MoE) model training uses token drop-less.

  • The following metrics are extracted from the average of 5 steps: G-Average Seq len, Tokens/sec/gpu, Total Step time(s). Because of the averaging, the numbers in table does not completely match the equation stated in Performance Metrics above but the difference is small.