On-Policy Corrections
On-Policy Corrections
On-Policy Corrections
When using an OpenAI-compatible HTTP server for RL training, fundamental issues arise in multi-step and multi-turn scenarios. This page explains these problems and the corrections required for on-policy training.
Policy optimization algorithms calculate and backpropagate through a loss calculated using log probabilities (logprobs). When rollout logprobs and token selections differ from those calculated at train time, training becomes off-policy. While algorithms can tolerate small amounts of off-policyness, excessive mismatch typically causes training runs to crash.
This page covers:
A single OpenAI HTTP request follows this lifecycle, where each step produces a single output:
A multi-step or multi-turn rollout makes multiple sequential requests to the model endpoint. For example, a multi-step multi-turn rollout with two turns:
Abbreviated notation: U R C TC T R C TC T C U
Most model endpoints return [R C TC] messages in a single response, so the rollout can be viewed as:
U [R C TC] T [R C TC] T [C] U, where brackets indicate a single model call.
Three problems cause train-generation log probability mismatch when using an OpenAI-compatible HTTP server.
Cause: Information loss when converting from token IDs (LF5) back to token IDs (LF3) across model calls.
In the previous model call, the model may produce token IDs 1 and 2 which de-tokenize to _Skinny in LF5. Then in LF3 of the next call, _Skinny might re-tokenize to token ID 3.
At generation time, logprobs for tokens following _Skinny are calculated using token IDs 1 and 2. At train time, the same logprobs are calculated using token ID 3, creating a mismatch.
Observed scenarios:
"_Ski" + "nny" → "_Skinny""_Ski" + "nny" → "_Skin" + "ny"Cause: Information loss when converting from generation string (LF6) back to templated string (LF2) across model calls.
At LF6, the model may produce token IDs that de-tokenize to:
This converts to an OpenAI tool call object:
At LF2 in the next call, the chat template may render this differently:
The deterministic chat template cannot match the stochastic model output format exactly.
Cause: Intentional modifications to rollout history during execution.
Developers sometimes modify rollout history:
These changes alter the prompt token IDs the model sees at the current call, differing from the final prompt token IDs used for training.
Two components address these problems:
For Problems 1 and 2, implement the on-policy token ID fix in the vLLM OpenAI HTTP server. Refer to the NeMo RL implementation.
Prerequisites:
model_prefix_token_ids: Ground truth prompt token IDs concatenated with generation token IDs from the previous model calltemplate_prefix_token_ids: Re-templated and re-tokenized token IDs up to (not including) the final assistant messagetemplate_token_ids: Re-templated and re-tokenized token IDs for the entire rolloutAssumption: template_prefix_token_ids is a strict prefix of template_token_ids (requires circumventing Problem 3).
Algorithm:
The fix finds the position of the correct EOS token ID in template_token_ids and splices in model_prefix_token_ids.
Example:
U A T A U A Tmodel_prefix_token_ids: Ground truth token IDs for U A T A U Atemplate_prefix_token_ids: Re-templated token IDs for U A T A U Atemplate_token_ids: Re-templated token IDs for U A T A U A Ttemplate_prefix_token_ids to find the EOS token position corresponding to model_prefix_token_idstemplate_token_ids prefix with model_prefix_token_idsFor Problem 3, disable reasoning truncation across turns using the chat template. Handling non-monotonic history during training remains an open research question.