CISPO#
Clipped Importance Sampling Policy Optimization (CISPO) is a GRPO-family policy-gradient objective that clips the importance-sampling weight as a detached coefficient instead of using PPO-style hard ratio clipping.
For each generated token, CISPO computes the policy ratio
r_t(theta) = pi_theta(o_t | q, o_<t) / pi_old(o_t | q, o_<t)
and uses a clipped, stop-gradient importance weight in the policy loss:
L_CISPO = -A_t * sg(clip(r_t(theta), 1 - eps_low, 1 + eps_high)) * log pi_theta(o_t | q, o_<t)
This keeps gradients flowing through log pi_theta for every token while bounding the scalar importance weight. In contrast, standard GRPO/PPO-style clipping can zero out the gradient contribution for tokens whose ratios leave the clip range.
Configuration#
CISPO uses the same GRPO training path and ClippedPGLossFn as GRPO. Enable it in the loss_fn block:
loss_fn:
use_cispo: true
token_level_loss: true
sequence_level_importance_ratios: false
force_on_policy_ratio: false
ratio_clip_min: 1.0
ratio_clip_max: 5.0
ratio_clip_c: null
ratio_clip_min and ratio_clip_max follow the paper’s additive epsilon convention. The effective clamp range is:
[1 - ratio_clip_min, 1 + ratio_clip_max]
For example, ratio_clip_min: 1.0 and ratio_clip_max: 5.0 clamp ratios to [0, 6]. Since policy ratios are non-negative, this is effectively an upper-only clamp at 6.
When use_importance_sampling_correction: true, the shared GRPO loss path additionally multiplies the CISPO token loss by the actor-vs-generation correction exp(prev_logprobs - generation_logprobs). This correction is separate from CISPO’s clipped pi_theta / pi_old weight.
Async Lag-1 Recipe#
The nightly CISPO recipe validates the objective in a high-off-policy setting with repeated updates per rollout and non-colocated async vLLM generation:
bash tests/test_suites/llm/grpo-cispo-mm1-async-lag1-highoffpolicy-qwen3-30ba3b-3n8g-megatron-cispo.sh
The corresponding config is:
examples/configs/recipes/llm/grpo-cispo-mm1-async-lag1-highoffpolicy-qwen3-30ba3b-3n8g-megatron-cispo.yaml
The recipe uses Qwen/Qwen3-30B-A3B, Megatron policy training, async GRPO with max_trajectory_age_steps: 1, and a separate non-colocated vLLM generation node.