Performance#

As part of the NVIDIA NeMo Framework, NeMo RL provides optimal performance for reinforcement learning on generative AI models by incorporating the latest optimizations - such as refit optimizations, mixed-precision training, and off-policy training.

This page provides performance benchmarks for LLMs and VLMs using NeMo RL across different GPU systems and configurations. The recipes to reproduce these runs, in yaml file form, can be found under this folder.

Nomenclature#

  • GBS: Global Batch Size

  • MBS: Micro Batch Size

  • TP: Tensor Parallel Size

  • PP: Pipeline Parallel Size

  • CP: Context Parallel Size

  • VP: Virtual Pipeline Parallel Size

  • EP: Expert Parallel Size

  • T-: Training related

  • G-: Generation related

  • Training backend: NeMo RL has two training backends: Megatron and PyTorch DTensor. This performance summary currently only shows numbers from the Megatron backend.

Performance Metrics#

Since reinforcement learning consists of training, generation and transition between the two, performance measurement also reflects this. Specifically, we track the following metrics:

  • Step time: Time for each step, which includes training, generation, policy logprobs, and refit time.

  • Tokens/sec/GPU: The rate at which the tokens are processed by a stage (such as training, generation, or refitting) on a single GPU:

    \[ \text{Tokens/sec/GPU} = \frac{\text{Total Tokens Processed}}{\text{Time for Stage} \times \text{Number of GPUs}} \]
  • Training MFU: Model floating-point operations per second per GPU

Performance Summary for Large Language Models#

Below are performance benchmarks for various large language models organized by release version. These results were obtained using performance recipes available here.

The performance data includes:

  • RL Performance: Performance metrics for various model sizes and architectures on different RL algorithms (GRPO and in the future DAPO, PPO, for both on-policy and asynchronous).

  • System Configurations: Results across different GPU systems (DGX-H100 and in the future DGX-GB200, DGX-B200)

  • Precision Options: Performance comparisons between different precision modes (BF16, FP8)


Nemo RL v0.6#

H100 BF16 Benchmarks#

Algorithm

Model

On/Off policy

T-Max Sequence Length

G-Average Seq len

#-GPUs

G-GBS

T-GBS

Generation [TP,PP]

Training [TP,CP,EP,PP,VPP]

Tokens / sec / GPU

Total Step time(s)

GRPO

DeepSeek V3

On policy

1,536

701

256

512

512

[32,1]

[1,1,16,16,n/a]

12.1

134

GRPO

DeepSeek V3

On policy

1,536

697

512

512

512

[32,1]

[1,1,16,16,n/a]

7.24

111

GRPO

DeepSeek V3

1-step Off

1,536

710

512

512

512

[32,1]

[1,1,16,16,n/a]

12.8

64.1

GRPO

Qwen3-235B

On policy

8,192

5,698

128

512

512

[16,1]

[2,2,16,8,n/a]

58.9

395

GRPO

Qwen3-235B

On policy

8,192

5,713

256

512

512

[16,1]

[2,2,16,8,n/a]

37.4

312

GRPO

Qwen3-235B

1-step Off

8,192

5,721

256

512

512

[8,1]

[4,1,16,8,n/a]

58.7

231

GRPO

Qwen3-30B3A

On policy

4,096

3,203

32

2,048

512

[2,1]

[1,1,8,1,n/a]

1102

192

GRPO

Qwen3-30B3A

1-step Off

4,096

3,201

32

2,048

512

[2,1]

[1,1,8,2,n/a]

1414

152

GRPO

Qwen3-30B3A

8-step Off

4,096

3,206

192

2,048

512

[2,1]

[1,1,8,1,n/a]

1025

34.5

H100 FP8 Benchmarks#

  • GRPO Dataset: OpenMathInstruct-2

  • System: DGX-H100

  • Precision: Generation FP8, Training FP8

  • Training Backend: Megatron-core.

Algorithm

Model

On/Off policy

T-Max Sequence Length

G-Average Seq len

#-GPUs

G-GBS

T-GBS

Generation [TP,PP]

Training [TP,CP,EP,PP,VPP]

Tokens / sec / GPU

Total Step time(s)

GRPO

DeepSeek V3

1-step Off

1,536

721

512

512

512

[16,1]

[1,1,16,16,n/a]

14.1

59.2

GB200 BF16 Benchmarks#

  • GRPO Dataset: OpenMathInstruct-2

  • System: GB200-NVL72

  • Precision: Training BF16, Generation BF16

  • Training Backend: Megatron-core.

Algorithm

Model

On/Off policy

T-Max Sequence Length

G-Average Seq len

#-GPUs

G-GBS

T-GBS

Generation [TP,PP]

Training [TP,CP,EP,PP,VPP]

Tokens / sec / GPU

Total Step time(s)

GRPO

DeepSeek V3

On policy

1,536

711

128

512

512

[32,1]

[1,1,16,8,n/a]

30.2

108

GRPO

DeepSeek V3

On policy

1,536

700

256

512

512

[32,1]

[1,1,16,8,n/a]

16.4

98.7

GRPO

DeepSeek V3

1-step Off

1,536

708

256

512

512

[16,1]

[1,1,16,8,n/a]

26.7

61.7

GRPO

Qwen3-235B

On policy

8,192

5,709

64

512

512

[8,1]

[2,2,16,4,n/a]

163

286

GRPO

Qwen3-235B

On policy

8,192

5,693

128

512

512

[8,1]

[2,2,16,4,n/a]

67.4

345

GRPO

Qwen3-235B

1-step Off

8,192

5,705

128

512

512

[8,1]

[4,1,16,4,n/a]

85.5

278

GRPO

Qwen3-30B3A

On policy

4,096

3,199

16

2,048

512

[1,1]

[1,1,16,1,n/a]

1,910

221

GRPO

Qwen3-30B3A

1-step Off

4,096

3,197

16

2,048

512

[1,1]

[1,1,16,1,n/a]

1,406

301

SWE

Nemotron-3-Nano-30B-A3B

1-step Off

131,072

31,599

128

512

512

[8,1]

[8,8,8,1,n/a]

37.5

430

Note:

  • All Mixture-of-expert (MoE) model training uses token drop-less.

  • The following metrics are extracted from the average of 5 steps: G-Average Seq len, Tokens/sec/gpu, Total Step time(s). Because of the averaging, the numbers in the table do not completely match the equation stated in Performance Metrics above but the difference is small.

  • There was a change in pretrained checkpoint (see docs/guides/deepseek.md) for DeepSeek V3 leading to lower Average Seq len. The reported throughput is not comparable across versions. Please use equivalent checkpoints for comparison. For example, using the newer checkpoint DeepSeek V3 on-policy GRPO #-GPUs: 128 v0.5.0 performs at 26.1 Tokens / sec / GPU compared to v0.6.0 at 30.2 Tokens / sec / GPU.