PPO#

Proximal Policy Optimization (PPO) is an actor-critic reinforcement learning algorithm that jointly trains a policy (actor) and a value function (critic). The value function estimates per-token state values, enabling Generalized Advantage Estimation (GAE) for lower-variance advantage signals compared to reward-only baselines.

Key Differences from GRPO#

  • Value Model (Critic): PPO trains a separate value model alongside the policy. GRPO uses a leave-one-out baseline and has no value model.

  • GAE Advantage Estimation: PPO uses temporal-difference bootstrapping via GAE. GRPO normalizes group rewards.

  • Critic Warmup: PPO supports training the value model for a configurable number of steps before starting policy training (policy_training_start_step).

  • VAPO Decoupled GAE: Supports separate lambda parameters for policy advantages and value returns (gae_lambda_policy, gae_lambda_value).

PPO Single Node#

uv run examples/run_ppo.py \
  --config examples/configs/ppo_math_1B_megatron.yaml \
  policy.model_name="Qwen/Qwen2.5-1.5B" \
  cluster.gpus_per_node=8 \
  checkpointing.checkpoint_dir="results/ppo_math" \
  logger.wandb_enabled=True \
  logger.wandb.name="ppo-math"

For Megatron-Core backend:

uv run examples/run_ppo.py \
  --config examples/configs/ppo_math_1B_megatron.yaml \
  policy.model_name="Qwen/Qwen2.5-1.5B" \
  cluster.gpus_per_node=8 \
  checkpointing.checkpoint_dir="results/ppo_megatron" \
  logger.wandb_enabled=True \
  logger.wandb.name="ppo-megatron"

PPO Multi-node#

NUM_ACTOR_NODES=8

COMMAND="uv run ./examples/run_ppo.py \
  --config examples/configs/ppo_math_1B_megatron.yaml \
  cluster.num_nodes=8 \
  cluster.gpus_per_node=8 \
  checkpointing.checkpoint_dir='results/ppo_8nodes' \
  logger.wandb_enabled=True \
  logger.wandb.name='ppo-multinode'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

Note

For GB200 systems with 4 GPUs per node, use --gres=gpu:4 instead.

Configuration#

PPO uses two base configurations:

Key PPO-specific parameters:

ppo:
  adv_estimator:
    name: "gae"
    gae_lambda: 0.95
    gae_gamma: 1.0
  ppo_epochs: 4
  policy_training_start_step: 0

value_loss_fn:
  scale: 0.4
  cliprange: 0.2

value:
  model_name: "Qwen/Qwen2.5-1.5B"

Additional Resources#