nemo_rl.algorithms.advantage_estimator#

Advantage Estimators for RL algorithms.

This module provides different advantage estimation strategies:

  • GRPOAdvantageEstimator: Standard GRPO advantage with leave-one-out baseline

  • GDPOAdvantageEstimator: Multi-reward GDPO (per-component baselines, sum then normalize)

  • ReinforcePlusPlusAdvantageEstimator: Reinforce++ with optional baseline subtraction (minus_baseline) and KL penalty in reward Reference papers:

  • ProRLv2: https://developer.nvidia.com/blog/scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2/

  • Reinforce++: https://arxiv.org/abs/2501.03262

Module Contents#

Classes#

GRPOAdvantageEstimator

GRPO-style advantage estimator with leave-one-out baseline.

GDPOAdvantageEstimator

GDPO-style advantage estimator with leave-one-out baseline.

ReinforcePlusPlusAdvantageEstimator

Reinforce++ advantage estimator with optional baseline subtraction and KL penalty in reward.

API#

class nemo_rl.algorithms.advantage_estimator.GRPOAdvantageEstimator(estimator_config: dict, loss_config: dict)#

GRPO-style advantage estimator with leave-one-out baseline.

Note: GRPO computes advantages over all responses for each prompt.

Initialization

compute_advantage(prompt_ids, rewards, mask, **kwargs)#

Compute GRPO advantages.

Parameters:
  • prompt_ids – Tensor of shape [batch_size] identifying which prompt each sample belongs to.

  • rewards – Tensor of shape [batch_size] containing reward for each sample.

  • mask – Response token mask of shape [batch_size, seq_len], 1 for valid response tokens, 0 for padding. Used only for expanding advantages to token-level shape.

  • **kwargs – Additional arguments (unused).

Returns:

Advantages tensor of shape [batch_size, seq_len].

class nemo_rl.algorithms.advantage_estimator.GDPOAdvantageEstimator(estimator_config: dict, loss_config: dict)#

GDPO-style advantage estimator with leave-one-out baseline.

Note: GDPO computes advantages for each reward separately over all responses for each prompt.

Initialization

compute_advantage(prompt_ids, rewards, mask, repeated_batch, **kwargs)#

Compute GDPO advantages.

Parameters:
  • prompt_ids – Tensor identifying which prompt each sample belongs to (for per-prompt baselines).

  • rewards – Unused; for interface consistency.

  • repeated_batch – Batch containing reward1, reward2, … keys.

  • mask – Response token mask of shape [batch_size, seq_len], 1 for valid response tokens, 0 for padding.

  • **kwargs – Additional arguments (unused).

Returns:

Advantages tensor of shape [batch_size, seq_len].

class nemo_rl.algorithms.advantage_estimator.ReinforcePlusPlusAdvantageEstimator(
estimator_config: dict,
loss_config: dict,
)#

Reinforce++ advantage estimator with optional baseline subtraction and KL penalty in reward.

Parameters:
  • minus_baseline – If True, subtract per-prompt mean baseline from rewards.

  • use_kl_in_reward – If True, add KL penalty to reward instead of loss.

Initialization

compute_advantage(
prompt_ids,
rewards,
mask,
logprobs_policy=None,
logprobs_reference=None,
**kwargs,
)#

Compute Reinforce++ advantages with optional KL penalty.

Parameters:
  • prompt_ids – Tensor of shape [batch_size] identifying which prompt each sample belongs to.

  • rewards – Tensor of shape [batch_size] containing reward for each sample.

  • mask – Response token mask of shape [batch_size, seq_len], 1 for valid response tokens, 0 for padding. Used for: (1) expanding advantages to token-level shape, (2) global normalization that only considers valid tokens.

  • logprobs_policy – Policy log probabilities of shape [batch_size, seq_len], required if use_kl_in_reward.

  • logprobs_reference – Reference policy log probabilities of shape [batch_size, seq_len], required if use_kl_in_reward.

  • **kwargs – Additional arguments (unused).

Returns:

Advantages tensor of shape [batch_size, seq_len], globally normalized across valid tokens.