Multi-Step Environment

View as Markdown

This tutorial focuses on the Resources Server implementation for a multi-step tool-calling environment. The full workflow — task data preparation, agent/model configuration, rollout collection, and training — follows the same steps as the single-step tutorial. What changes here is the complexity of the tools and verification logic.

← Single-Step Environment

What You’ll Build

Many real tasks require a model to call tools in sequence, where the output of one call informs the next. For example: look up several pieces of information, then combine them into a final answer. This tutorial shows how to build and verify that kind of environment.

The example environment has two tools: get_synonym_value (lookup a numeric value for a word) and extract_synonym_values (submit the collected values). The agent must look up values for each synonym one by one, then submit the complete list. Reward is 1.0 for exact match, 0.0 otherwise.

Episode Flow

Goal (what the agent is learning)
- Learn a multi-step tool workflow: call the right tool(s), carry values forward, and submit them in the required format.
- It is *not* "learning ASCII math" as a capability. The ASCII-sum is just a deterministic placeholder tool so we can grade behavior reliably.
Inputs (from one JSONL row)
- expected_synonyms: ["Warm", "Blazing", ...]
- expected_synonym_values: [407, 711, ...] # ground truth for grading
- minefield_label/value: ("Hot", 299) # optional failure-mode tracking
What does synonym_value mean?
- `synonym_value` is the numeric output returned by the tool `/get_synonym_value`.
- In this example implementation, it's computed as the sum of character code points for the synonym string (e.g., "Warm" -> 407).
+--------------------------- ResponsesAPIAgent (/run) ---------------------------+
| |
| 1) Initialize episode state |
| POST ResourcesServer /seed_session |
| |
| 2) Interaction loop (repeat up to max_steps) |
| POST ModelServer /v1/responses |
| +- if output contains text: keep it in the conversation |
| +- if output contains function_call(name=TOOL, arguments=...): |
| POST ResourcesServer /{TOOL} |
| - /get_synonym_value(synonym="Warm") -> synonym_value=407 |
| - /get_synonym_value(synonym="Blazing")-> synonym_value=711 |
| append tool result back into the conversation |
| |
| (agent eventually submits) |
| function_call: extract_synonym_values(synonym_values=[407, 711, ...]) |
| |
| 3) Grade the rollout (reward) |
| POST ResourcesServer /verify |
| - parse the final extract_synonym_values(...) arguments from the response|
| - compare to expected_synonym_values |
| - reward = 1.0 if exact match else 0.0 (plus extra metrics) |
+--------------------------------------------------------------------------------+

This is the simple_agent loop from the single-step tutorial in action: the model is called, it emits a function_call, the tool executes, the result is appended back into the conversation, and the model is called again with the updated context. This cycle repeats until the model produces a final text response or hits max_steps. In a multi-step environment, this iterative loop is where the agent learns to chain tool calls correctly.


Implementation

File (simplified from resources_servers/example_multi_step/app.py, with added defensive guards):

1# simplified
2import json
3from typing import List
4
5from fastapi import FastAPI
6from pydantic import BaseModel
7
8from nemo_gym.base_resources_server import (
9 BaseResourcesServerConfig,
10 BaseRunRequest,
11 BaseVerifyRequest,
12 BaseVerifyResponse,
13 SimpleResourcesServer,
14)
15
16class ExampleMultiStepResourcesServerConfig(BaseResourcesServerConfig):
17 pass
18
19# Custom request types with task-specific metadata
20class ExampleMultiStepRunRequest(BaseRunRequest):
21 id: int
22 expected_synonym_values: List[int]
23 expected_synonyms: List[str]
24 minefield_label: str
25 minefield_label_value: int
26
27class ExampleMultiStepVerifyRequest(ExampleMultiStepRunRequest, BaseVerifyRequest):
28 pass
29
30# Extended verify response with detailed metrics
31class ExampleMultiStepVerifyResponse(BaseVerifyResponse):
32 parsed_synonym_values: List[int]
33 accuracy: bool
34 set_overlap: float
35 original_term_minefield_hit: bool
36 order_instruction_following_failure: bool
37
38# Tool request/response models
39class GetSynonymValueRequest(BaseModel):
40 synonym: str
41
42class GetSynonymValueResponse(BaseModel):
43 synonym_value: int
44
45class ExtractSynonymValuesRequest(BaseModel):
46 synonym_values: List[int]
47
48class ExtractSynonymValuesResponse(BaseModel):
49 success: bool
50
51class ExampleMultiStepResourcesServer(SimpleResourcesServer):
52 config: ExampleMultiStepResourcesServerConfig
53
54 def setup_webserver(self) -> FastAPI:
55 app = super().setup_webserver()
56
57 # Register multiple tool endpoints
58 app.post("/get_synonym_value")(self.get_synonym_value)
59 app.post("/extract_synonym_values")(self.extract_synonym_values)
60
61 return app
62
63 # Tool 1: Get the numeric value for a synonym
64 async def get_synonym_value(self, body: GetSynonymValueRequest) -> GetSynonymValueResponse:
65 # Simple deterministic function: sum of character code points
66 return GetSynonymValueResponse(synonym_value=sum(map(ord, body.synonym)))
67
68 # Tool 2: Extract/submit the final answer
69 async def extract_synonym_values(
70 self, body: ExtractSynonymValuesRequest
71 ) -> ExtractSynonymValuesResponse:
72 return ExtractSynonymValuesResponse(success=True)
73
74 # THE REWARD FUNCTION - This is where RL magic happens
75 async def verify(
76 self, body: ExampleMultiStepVerifyRequest
77 ) -> ExampleMultiStepVerifyResponse:
78 expected = body.expected_synonym_values # Pulls the ground truth
79
80 # Parse the agent's final answer from its response
81 actual = []
82 for output in reversed(body.response.output):
83 if output.type == "function_call" and output.name == "extract_synonym_values":
84 try:
85 actual = json.loads(output.arguments)["synonym_values"]
86 except (json.JSONDecodeError, KeyError):
87 actual = []
88 break
89
90 # Compute reward based on exact match
91 accuracy = expected == actual
92 set_overlap = len(set(actual) & set(expected)) / len(expected) if expected else 0.0
93
94 return ExampleMultiStepVerifyResponse(
95 **body.model_dump(),
96 reward=float(accuracy), # 1.0 if correct, 0.0 otherwise
97 parsed_synonym_values=actual,
98 accuracy=accuracy,
99 set_overlap=set_overlap,
100 original_term_minefield_hit=body.minefield_label in actual or body.minefield_label_value in actual,
101 order_instruction_following_failure=not accuracy and set_overlap == 1.0,
102 )
103
104if __name__ == "__main__":
105 ExampleMultiStepResourcesServer.run_webserver()

Key Insight

The verify() function parses the agent’s tool calls from body.response.output and computes a reward by comparing against ground truth (body.expected_synonym_values). The ground truth fields come from the JSONL dataset row and are passed through the ExampleMultiStepVerifyRequest.

The json.loads(output.arguments) call is wrapped in try/except to handle cases where the model produces malformed JSON. Always guard against unparseable model output in your verify function.

Multi-step environments where tool outputs depend on earlier calls pair naturally with parallel_tool_calls: false in the JSONL data, which forces the model to call tools sequentially rather than in parallel.


Rollout Transcript

[Episode start]
Agent -> ResourcesServer: POST /seed_session
(environment is initialized for this episode)
User: "For the synonyms ['Warm', 'Blazing'], look up each synonym_value and then submit the list."
Agent -> ModelServer: POST /v1/responses (tools available: get_synonym_value, extract_synonym_values)
Model decides to call a tool:
function_call: get_synonym_value({"synonym": "Warm"})
Agent -> ResourcesServer: POST /get_synonym_value {"synonym": "Warm"}
ResourcesServer -> Agent:
{"synonym_value": 407}
Agent -> ModelServer: POST /v1/responses (now includes tool output 407)
Model calls next tool:
function_call: get_synonym_value({"synonym": "Blazing"})
Agent -> ResourcesServer: POST /get_synonym_value {"synonym": "Blazing"}
ResourcesServer -> Agent:
{"synonym_value": 711}
Agent -> ModelServer: POST /v1/responses (now includes tool output 711)
Model submits final answer via the "submit" tool:
function_call: extract_synonym_values({"synonym_values": [407, 711]})
[Episode end -> grading]
Agent -> ResourcesServer: POST /verify (includes the full response trace + ground truth fields)
ResourcesServer:
- parses the extract_synonym_values(...) arguments -> actual=[407, 711]
- compares to expected_synonym_values from the dataset row
- returns reward: 1.0 if exact match else 0.0

Stateful Environment →