Performance#

As part of the NVIDIA NeMo Framework, NeMo RL, provides optimal performance for reinforcement learning on generative AI models by incorporating the latest optimizations - such as refit optimizations, mixed-precision training, and off-policy training.

This page provides performance benchmarks for LLMs and VLMs using NeMo RL across different GPU systems and configurations. The recipes to reproduce these runs, in yaml file form, can be found under this folder.

Nomenclature#

GBS: Global Batch Size
MBS: Micro Batch Size
TP: Tensor Parallel Size
PP: Pipeline Parallel Size
CP: Context Parallel Size
VP: Virtual Pipeline Parallel Size
EP: Expert Parallel Size
T-: Training related
G-: Generation related
Training backend: NeMo RL have two training backends: Megatron and PyTorch DTensor. This performance summary currently only shows number from Megatron backend.

Performance Metrics#

Since reinforcement learning consists of training, generation and transition between the two, performance measurement also reflects this. Specifically, we track the following metrics:

Step time: Time for each step, which includes training, generation, policy logprobs, and refit time.
Tokens/sec/GPU: The rate at the tokens are processed by a stage (such as training, generation, or refitting) on a single GPU:

\[ \text{Tokens/sec/GPU} = \frac{\text{Total Tokens Processed}}{\text{Time for Stage} \times \text{Number of GPUs}} \]
Training MFU: Model floating-point operations per second per GPU

Performance Summary for Large Language Models#

Below are performance benchmarks for various large language models organized by release version. These results were obtained using performance recipes available here.

The performance data includes:

RL Performance: Performance metrics for various model sizes and architectures on different RL algorithms (GRPO and in the future DAPO, PPO, for both on-policy and asynchronous).
System Configurations: Results across different GPU systems (DGX-H100 and in the future DGX-GB200, DGX-B200)
Precision Options: Performance comparisons between different precision modes (BF16, FP8)

Nemo RL v0.5#

H100 BF16 Benchmarks#

GRPO Dataset: OpenMathInstruct-2; DAPO dataset: DAPOMath17k
System: DGX-H100
Precision: Training BF16, Generation BF16
Training Backend: Megatron-core.

Algorithm	Model	On/Off policy	T-Max Sequence Length	G-Average Seq len	#-GPUs	G-GBS	T-GBS	Generation [TP,PP]	Training [TP,CP,EP,PP,VPP]	Tokens / sec / GPU	Total Step time(s)
GRPO	LLAMA3.1_8B	On policy	4,096	1,019	16	2,048	512	[1,1]	[1,1,1,1,1,2,n/a]	1,581	92.8
GRPO	LLAMA3.1_8B	1-step Off	4,096	1,123	16	2,048	512	[1,1]	[1,1,1,1,1,1,n/a]	2,478	64.8
GRPO	DeepSeek V3	On policy	1,536	744	256	512	512	[32,1]	[1,1,16,16,n/a]	12.7	134
GRPO	DeepSeek V3	1-step Off	1,536	738	512	512	512	[32,1]	[1,1,16,16,n/a]	13.1	64.9
DAPO	DeepSeek V3	On policy	1,536	974	512	512	512	[64,1]	[8,4,32,8,n/a]	2.45	974
GRPO	Qwen3-235B	On policy	8,192	5,700	128	512	512	[16,1]	[2,2,16,8,n/a]	54.1	431
GRPO	Qwen3-235B	1-step Off	8,192	5,707	256	512	512	[8,1]	[4,1,16,8,n/a]	58.7	203
GRPO	Qwen3-30B3A	On policy	4,096	3,196	32	2,048	512	[2,1]	[1,1,8,1,n/a]	1066	198
GRPO	Qwen3-30B3A	1-step Off	4,096	3,201	32	2,048	512	[2,1]	[1,1,8,2,n/a]	1391	154
GRPO	Qwen3-32B	On policy	4,096	3,251	32	2,048	512	[4,1]	[4,1,1,4,n/a]	571	376
GRPO	Qwen3-32B	1-step Off	4,096	3,252	64	2,048	512	[4,1]	[4,1,1,4,n/a]	538	200

H100 FP8 Benchmarks#

GRPO Dataset: OpenMathInstruct-2
System: DGX-H100
Precision: Generation FP8, Training FP8
Training Backend: Megatron-core.

Algorithm	Model	On/Off policy	T-Max Sequence Length	G-Average Seq len	#-GPUs	G-GBS	T-GBS	Generation [TP,PP]	Training [TP,CP,EP,PP,VPP]	Tokens / sec / GPU	Total Step time(s)
GRPO	LLAMA3.1_8B	1-step Off	4,096	1,128	16	2,048	512	[1,1]	[1,1,1,1,1,1,n/a]	3,052	53.0
GRPO	DeepSeek V3	1-step Off	1,536	761	512	512	512	[16,1]	[1,1,16,16,n/a]	14.1	67.6

GB200 BF16 Benchmarks#

GRPO Dataset: OpenMathInstruct-2
System: GB200-NVL72
Precision: Training BF16, Generation BF16
Training Backend: Megatron-core.

Algorithm	Model	On/Off policy	T-Max Sequence Length	G-Average Seq len	#-GPUs	G-GBS	T-GBS	Generation [TP,PP]	Training [TP,CP,EP,PP,VPP]	Tokens / sec / GPU	Total Step time(s)
GRPO	LLAMA3.1_8B	On policy	4,096	1,066	8	2,048	512	[1,1]	[1,1,1,1,1,1,n/a]	3,359	91.0
GRPO	LLAMA3.1_8B	1-step Off	4,096	1,107	8	2,048	512	[1,1]	[1,1,1,1,1,1,n/a]	4,463	71.1
GRPO	DeepSeek V3	On policy	1,536	996	128	512	512	[32,1]	[1,1,16,8,n/a]	34.3	128
GRPO	DeepSeek V3	1-step Off	1,536	994	256	512	512	[16,1]	[1,1,16,8,n/a]	31.7	64.5
GRPO	Qwen3-235B	On policy	8,192	5,711	64	512	512	[8,1]	[2,2,16,4,n/a]	140	332
GRPO	Qwen3-235B	1-step Off	8,192	5,711	128	512	512	[8,1]	[4,1,16,4,n/a]	87.9	268
GRPO	Qwen3-30B3A	On policy	4,096	3,198	16	2,048	512	[1,1]	[1,1,16,1,n/a]	1,822	232
GRPO	Qwen3-30B3A	1-step Off	4,096	3,204	32	2,048	512	[1,1]	[1,1,16,1,n/a]	1,558	136
GRPO	Qwen3-32B	On policy	4,096	3,253	16	2,048	512	[1,1]	[2,1,1,1,n/a]	1,127	381
GRPO	Qwen3-32B	1-step Off	4,096	3,258	32	2,048	512	[1,1]	[2,1,1,1,n/a]	1,025	210

Note:

All Mixture-of-expert (MoE) model training uses token drop-less.
The following metrics are extracted from the average of 5 steps: G-Average Seq len, Tokens/sec/gpu, Total Step time(s). Because of the averaging, the numbers in table does not completely match the equation stated in Performance Metrics above but the difference is small.