On-Policy Corrections
On-Policy Corrections
When using an OpenAI-compatible HTTP server for RL training, fundamental issues arise in multi-step and multi-turn scenarios. This page explains these problems and the corrections required for on-policy training.
Overview
Policy optimization algorithms calculate and backpropagate through a loss calculated using log probabilities (logprobs). When rollout logprobs and token selections differ from those calculated at train time, training becomes off-policy. While algorithms can tolerate small amounts of off-policyness, excessive mismatch typically causes training runs to crash.
This page covers:
- Preliminaries: Understanding HTTP request lifecycles and rollout structure
- Problems: Three causes of train-generation mismatch
- Solutions: Token ID fixes implemented in the generation server
- Data interface: How token IDs cross the boundary between NeMo Gym and the training framework
Preliminaries
HTTP Request Lifecycle
A single OpenAI HTTP request follows this lifecycle, where each step produces a single output:
Rollout Structure
A multi-step or multi-turn rollout makes multiple sequential requests to the model endpoint. For example, a multi-step multi-turn rollout with two turns:
Example: Multi-Turn Rollout Sequence
- [First turn] User message
- Assistant Reasoning message
- Assistant Chat
- Assistant Tool Call
- Tool response
- Reasoning
- Chat
- Tool Call
- Tool
- [First turn, third step] Chat
- [Second turn] User message
- …
Abbreviated notation: U R C TC T R C TC T C U
- U: User message (independent of model)
- T: Tool response (independent of model)
- R, C, TC: Reasoning, Chat, Tool Call (from model endpoint)
Most model endpoints return [R C TC] messages in a single response, so the rollout can be viewed as:
U [R C TC] T [R C TC] T [C] U, where brackets indicate a single model call.
Problems
Three problems cause train-generation log probability mismatch when using an OpenAI-compatible HTTP server.
Problem 1: Re-Tokenization
Cause: Information loss when converting from token IDs (LF5) back to token IDs (LF3) across model calls.
Technical Details: Re-Tokenization
In the previous model call, the model may produce token IDs 1 and 2 which de-tokenize to _Skinny in LF5. Then in LF3 of the next call, _Skinny might re-tokenize to token ID 3.
At generation time, logprobs for tokens following _Skinny are calculated using token IDs 1 and 2. At train time, the same logprobs are calculated using token ID 3, creating a mismatch.
Observed scenarios:
- Merging: Token IDs 1 and 2 re-tokenize to single token ID 3
- Example:
"_Ski" + "nny"→"_Skinny"
- Example:
- Different split: Token IDs 1 and 2 re-tokenize to different token IDs 3 and 4
- Example:
"_Ski" + "nny"→"_Skin" + "ny"
- Example:
Problem 2: Re-Chat Templating
Cause: Information loss when converting from generation string (LF6) back to templated string (LF2) across model calls.
Technical Details: Re-Chat Templating
At LF6, the model may produce token IDs that de-tokenize to:
This converts to an OpenAI tool call object:
At LF2 in the next call, the chat template may render this differently:
The deterministic chat template cannot match the stochastic model output format exactly.
Problem 3: Non-Monotonically Increasing History
Cause: Intentional modifications to rollout history during execution.
Technical Details: History Modification
Developers sometimes modify rollout history:
- Agentic coding harnesses: Summarize or truncate prior history as rollouts grow longer
- Model chat templates: Remove reasoning from input prompt across turns
These changes alter the prompt token IDs the model sees at the current call, differing from the final prompt token IDs used for training.
Solution
Two components address these problems:
On-Policy Token ID Fix
For Problems 1 and 2, implement the on-policy token ID fix in the vLLM OpenAI HTTP server. Refer to the NeMo RL implementation.
Implementation Details: Token ID Fix
Prerequisites:
model_prefix_token_ids: Ground truth prompt token IDs concatenated with generation token IDs from the previous model calltemplate_prefix_token_ids: Re-templated and re-tokenized token IDs up to (not including) the final assistant messagetemplate_token_ids: Re-templated and re-tokenized token IDs for the entire rollout
Assumption: template_prefix_token_ids is a strict prefix of template_token_ids (requires circumventing Problem 3).
Algorithm:
The fix finds the position of the correct EOS token ID in template_token_ids and splices in model_prefix_token_ids.
Example:
- Current request (LF1) contains rollout structure:
U A T A U A T - Variables:
model_prefix_token_ids: Ground truth token IDs forU A T A U Atemplate_prefix_token_ids: Re-templated token IDs forU A T A U Atemplate_token_ids: Re-templated token IDs forU A T A U A T
- Use
template_prefix_token_idsto find the EOS token position corresponding tomodel_prefix_token_ids - Splice
template_token_idsprefix withmodel_prefix_token_ids
Reasoning Truncation Handling
For Problem 3, disable reasoning truncation across turns using the chat template. Handling non-monotonic history during training remains an open research question.
Data Interface
The on-policy token ID fix depends on model_prefix_token_ids — the exact token IDs the model was prompted with and sampled on previous calls. This section describes how those token IDs cross the boundary between NeMo Gym and the training framework, and the single rule that keeps them consistent across turns.
What the model server returns
When NeMo Gym runs against a training-enabled model server (return_token_id_information: true, provided by vllm_model_for_training.yaml), every model response carries three additional fields on its output items:
These fields attach to the final output item of each model call, and generation_token_ids covers that call’s entire generation. Environment-produced items, such as tool-call outputs, carry no token IDs because the model did not generate them. For the field definitions and the configuration flag that enables them, see VLLMModel.
Message-level token IDs are the single source of truth
A multi-step or multi-turn rollout sends the accumulated history back to the model on every call. To stay on-policy (Problems 1 and 2), the framework must reconstruct each new prompt from the token IDs the model actually emitted on prior calls — not from a re-tokenization of their de-tokenized text.
NeMo Gym carries those token IDs forward on the messages themselves: the three fields returned on one call’s output item are sent back, unchanged, on that same message in the next call’s request. The on-policy fix then derives model_prefix_token_ids directly from these message-level fields. This is the entire interface contract — token IDs are produced once, by the model server, and propagated turn-to-turn on the message they belong to.
An agent harness or custom client integrating with the framework is responsible for passing these three fields through unchanged, both on the messages it sends in subsequent requests and on the responses it returns.
Do not construct prompt or prefix token IDs yourself and inject them out of band (for example, as a separate top-level request field). The framework reconstructs the prefix from the message-level token IDs described above; a separately supplied prefix is a second, competing source of truth that can silently diverge from the message history or be dropped as the request is validated and forwarded. Forward the model server’s per-message token IDs unchanged and let the framework build the prefix from them.
For guidance on propagating these fields when building an environment or a custom client, see Integrate External Environments and Create a New Environment.
Related Topics
- Generation Backend And Openai Compatible Http Server - Generation backend requirements
- Gym Integration Footprint And Form Factor - Full integration component breakdown