Multi-Step Environment
Multi-Step Environment
This tutorial focuses on the Resources Server implementation for a multi-step tool-calling environment. The full workflow — task data preparation, agent/model configuration, rollout collection, and training — follows the same steps as the single-step tutorial. What changes here is the complexity of the tools and verification logic.
← Single-Step EnvironmentWhat You’ll Build
Many real tasks require a model to call tools in sequence, where the output of one call informs the next. For example: look up several pieces of information, then combine them into a final answer. This tutorial shows how to build and verify that kind of environment.
The example environment has two tools: get_synonym_value (lookup a numeric value for a word) and extract_synonym_values (submit the collected values). The agent must look up values for each synonym one by one, then submit the complete list. Reward is 1.0 for exact match, 0.0 otherwise.
Episode Flow
This is the simple_agent loop from the single-step tutorial in action: the model is called, it emits a function_call, the tool executes, the result is appended back into the conversation, and the model is called again with the updated context. This cycle repeats until the model produces a final text response or hits max_steps. In a multi-step environment, this iterative loop is where the agent learns to chain tool calls correctly.
Implementation
File (simplified from resources_servers/example_multi_step/app.py, with added defensive guards):
Key Insight
The verify() function parses the agent’s tool calls from body.response.output and computes a reward by comparing against ground truth (body.expected_synonym_values). The ground truth fields come from the JSONL dataset row and are passed through the ExampleMultiStepVerifyRequest.
The json.loads(output.arguments) call is wrapped in try/except to handle cases where the model produces malformed JSON. Always guard against unparseable model output in your verify function.
Multi-step environments where tool outputs depend on earlier calls pair naturally with parallel_tool_calls: false in the JSONL data, which forces the model to call tools sequentially rather than in parallel.
Rollout Transcript
Stateful Environment →