nemo_rl.algorithms.advantage_estimator#
Advantage Estimators for RL algorithms.
This module provides different advantage estimation strategies:
GRPOAdvantageEstimator: Standard GRPO advantage with leave-one-out baseline
GDPOAdvantageEstimator: Multi-reward GDPO (per-component baselines, sum then normalize)
ReinforcePlusPlusAdvantageEstimator: Reinforce++ with optional baseline subtraction (minus_baseline) and KL penalty in reward Reference papers:
ProRLv2: https://developer.nvidia.com/blog/scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2/
Reinforce++: https://arxiv.org/abs/2501.03262
Module Contents#
Classes#
GRPO-style advantage estimator with leave-one-out baseline. |
|
GDPO-style advantage estimator with leave-one-out baseline. |
|
Reinforce++ advantage estimator with optional baseline subtraction and KL penalty in reward. |
API#
- class nemo_rl.algorithms.advantage_estimator.GRPOAdvantageEstimator(estimator_config: dict, loss_config: dict)#
GRPO-style advantage estimator with leave-one-out baseline.
Note: GRPO computes advantages over all responses for each prompt.
Initialization
- compute_advantage(prompt_ids, rewards, mask, **kwargs)#
Compute GRPO advantages.
- Parameters:
prompt_ids – Tensor of shape [batch_size] identifying which prompt each sample belongs to.
rewards – Tensor of shape [batch_size] containing reward for each sample.
mask – Response token mask of shape [batch_size, seq_len], 1 for valid response tokens, 0 for padding. Used only for expanding advantages to token-level shape.
**kwargs – Additional arguments (unused).
- Returns:
Advantages tensor of shape [batch_size, seq_len].
- class nemo_rl.algorithms.advantage_estimator.GDPOAdvantageEstimator(estimator_config: dict, loss_config: dict)#
GDPO-style advantage estimator with leave-one-out baseline.
Note: GDPO computes advantages for each reward separately over all responses for each prompt.
Initialization
- compute_advantage(prompt_ids, rewards, mask, repeated_batch, **kwargs)#
Compute GDPO advantages.
- Parameters:
prompt_ids – Tensor identifying which prompt each sample belongs to (for per-prompt baselines).
rewards – Unused; for interface consistency.
repeated_batch – Batch containing reward1, reward2, … keys.
mask – Response token mask of shape [batch_size, seq_len], 1 for valid response tokens, 0 for padding.
**kwargs – Additional arguments (unused).
- Returns:
Advantages tensor of shape [batch_size, seq_len].
- class nemo_rl.algorithms.advantage_estimator.ReinforcePlusPlusAdvantageEstimator(
- estimator_config: dict,
- loss_config: dict,
Reinforce++ advantage estimator with optional baseline subtraction and KL penalty in reward.
- Parameters:
minus_baseline – If True, subtract per-prompt mean baseline from rewards.
use_kl_in_reward – If True, add KL penalty to reward instead of loss.
Initialization
- compute_advantage(
- prompt_ids,
- rewards,
- mask,
- logprobs_policy=None,
- logprobs_reference=None,
- **kwargs,
Compute Reinforce++ advantages with optional KL penalty.
- Parameters:
prompt_ids – Tensor of shape [batch_size] identifying which prompt each sample belongs to.
rewards – Tensor of shape [batch_size] containing reward for each sample.
mask – Response token mask of shape [batch_size, seq_len], 1 for valid response tokens, 0 for padding. Used for: (1) expanding advantages to token-level shape, (2) global normalization that only considers valid tokens.
logprobs_policy – Policy log probabilities of shape [batch_size, seq_len], required if use_kl_in_reward.
logprobs_reference – Reference policy log probabilities of shape [batch_size, seq_len], required if use_kl_in_reward.
**kwargs – Additional arguments (unused).
- Returns:
Advantages tensor of shape [batch_size, seq_len], globally normalized across valid tokens.