Multi-Step Environment

This tutorial focuses on the Resources Server implementation for a multi-step tool-calling environment. The full workflow — task data preparation, agent/model configuration, rollout collection, and training — follows the same steps as the single-step tutorial. What changes here is the complexity of the tools and verification logic.

← Single-Step Environment

What You’ll Build

Many real tasks require a model to call tools in sequence, where the output of one call informs the next. For example: look up several pieces of information, then combine them into a final answer. This tutorial shows how to build and verify that kind of environment.

The example environment has two tools: get_synonym_value (lookup a numeric value for a word) and extract_synonym_values (submit the collected values). The agent must look up values for each synonym one by one, then submit the complete list. Reward is 1.0 for exact match, 0.0 otherwise.

Episode Flow

Goal (what the agent is learning)
  - Learn a multi-step tool workflow: call the right tool(s), carry values forward, and submit them in the required format.
  - It is *not* "learning ASCII math" as a capability. The ASCII-sum is just a deterministic placeholder tool so we can grade behavior reliably.
Inputs (from one JSONL row)
  - expected_synonyms:         ["Warm", "Blazing", ...]
  - expected_synonym_values:   [407, 711, ...]          # ground truth for grading
  - minefield_label/value:     ("Hot", 299)             # optional failure-mode tracking
What does synonym_value mean?
  - `synonym_value` is the numeric output returned by the tool `/get_synonym_value`.
  - In this example implementation, it's computed as the sum of character code points for the synonym string (e.g., "Warm" -> 407).
+--------------------------- ResponsesAPIAgent (/run) ---------------------------+
|                                                                                |
|  1) Initialize episode state                                                   |
|     POST ResourcesServer /seed_session                                         |
|                                                                                |
|  2) Interaction loop (repeat up to max_steps)                                  |
|     POST ModelServer /v1/responses                                             |
|       +- if output contains text: keep it in the conversation                  |
|       +- if output contains function_call(name=TOOL, arguments=...):           |
|              POST ResourcesServer /{TOOL}                                      |
|                - /get_synonym_value(synonym="Warm")   -> synonym_value=407     |
|                - /get_synonym_value(synonym="Blazing")-> synonym_value=711     |
|              append tool result back into the conversation                     |
|                                                                                |
|     (agent eventually submits)                                                 |
|       function_call: extract_synonym_values(synonym_values=[407, 711, ...])    |
|                                                                                |
|  3) Grade the rollout (reward)                                                 |
|     POST ResourcesServer /verify                                               |
|       - parse the final extract_synonym_values(...) arguments from the response|
|       - compare to expected_synonym_values                                     |
|       - reward = 1.0 if exact match else 0.0 (plus extra metrics)              |
+--------------------------------------------------------------------------------+

This is the simple_agent loop from the single-step tutorial in action: the model is called, it emits a function_call, the tool executes, the result is appended back into the conversation, and the model is called again with the updated context. This cycle repeats until the model produces a final text response or hits max_steps. In a multi-step environment, this iterative loop is where the agent learns to chain tool calls correctly.

Implementation

File (simplified from resources_servers/example_multi_step/app.py, with added defensive guards):

1 # simplified
2 import json
3 from typing import List
4 
5 from fastapi import FastAPI
6 from pydantic import BaseModel
7 
8 from nemo_gym.base_resources_server import (
9     BaseResourcesServerConfig,
10     BaseRunRequest,
11     BaseVerifyRequest,
12     BaseVerifyResponse,
13     SimpleResourcesServer,
14 )
15 
16 class ExampleMultiStepResourcesServerConfig(BaseResourcesServerConfig):
17     pass
18 
19 # Custom request types with task-specific metadata
20 class ExampleMultiStepRunRequest(BaseRunRequest):
21     id: int
22     expected_synonym_values: List[int]
23     expected_synonyms: List[str]
24     minefield_label: str
25     minefield_label_value: int
26 
27 class ExampleMultiStepVerifyRequest(ExampleMultiStepRunRequest, BaseVerifyRequest):
28     pass
29 
30 # Extended verify response with detailed metrics
31 class ExampleMultiStepVerifyResponse(BaseVerifyResponse):
32     parsed_synonym_values: List[int]
33     accuracy: bool
34     set_overlap: float
35     original_term_minefield_hit: bool
36     order_instruction_following_failure: bool
37 
38 # Tool request/response models
39 class GetSynonymValueRequest(BaseModel):
40     synonym: str
41 
42 class GetSynonymValueResponse(BaseModel):
43     synonym_value: int
44 
45 class ExtractSynonymValuesRequest(BaseModel):
46     synonym_values: List[int]
47 
48 class ExtractSynonymValuesResponse(BaseModel):
49     success: bool
50 
51 class ExampleMultiStepResourcesServer(SimpleResourcesServer):
52     config: ExampleMultiStepResourcesServerConfig
53 
54     def setup_webserver(self) -> FastAPI:
55         app = super().setup_webserver()
56 
57         # Register multiple tool endpoints
58         app.post("/get_synonym_value")(self.get_synonym_value)
59         app.post("/extract_synonym_values")(self.extract_synonym_values)
60 
61         return app
62 
63     # Tool 1: Get the numeric value for a synonym
64     async def get_synonym_value(self, body: GetSynonymValueRequest) -> GetSynonymValueResponse:
65         # Simple deterministic function: sum of character code points
66         return GetSynonymValueResponse(synonym_value=sum(map(ord, body.synonym)))
67 
68     # Tool 2: Extract/submit the final answer
69     async def extract_synonym_values(
70         self, body: ExtractSynonymValuesRequest
71     ) -> ExtractSynonymValuesResponse:
72         return ExtractSynonymValuesResponse(success=True)
73 
74     # THE REWARD FUNCTION - This is where RL magic happens
75     async def verify(
76         self, body: ExampleMultiStepVerifyRequest
77     ) -> ExampleMultiStepVerifyResponse:
78         expected = body.expected_synonym_values  # Pulls the ground truth
79 
80         # Parse the agent's final answer from its response
81         actual = []
82         for output in reversed(body.response.output):
83             if output.type == "function_call" and output.name == "extract_synonym_values":
84                 try:
85                     actual = json.loads(output.arguments)["synonym_values"]
86                 except (json.JSONDecodeError, KeyError):
87                     actual = []
88                 break
89 
90         # Compute reward based on exact match
91         accuracy = expected == actual
92         set_overlap = len(set(actual) & set(expected)) / len(expected) if expected else 0.0
93 
94         return ExampleMultiStepVerifyResponse(
95             **body.model_dump(),
96             reward=float(accuracy),  # 1.0 if correct, 0.0 otherwise
97             parsed_synonym_values=actual,
98             accuracy=accuracy,
99             set_overlap=set_overlap,
100             original_term_minefield_hit=body.minefield_label in actual or body.minefield_label_value in actual,
101             order_instruction_following_failure=not accuracy and set_overlap == 1.0,
102         )
103 
104 if __name__ == "__main__":
105     ExampleMultiStepResourcesServer.run_webserver()

Key Insight

The verify() function parses the agent’s tool calls from body.response.output and computes a reward by comparing against ground truth (body.expected_synonym_values). The ground truth fields come from the JSONL dataset row and are passed through the ExampleMultiStepVerifyRequest.

The json.loads(output.arguments) call is wrapped in try/except to handle cases where the model produces malformed JSON. Always guard against unparseable model output in your verify function.

Multi-step environments where tool outputs depend on earlier calls pair naturally with parallel_tool_calls: false in the JSONL data, which forces the model to call tools sequentially rather than in parallel.

Rollout Transcript

[Episode start]
Agent -> ResourcesServer: POST /seed_session
  (environment is initialized for this episode)
User: "For the synonyms ['Warm', 'Blazing'], look up each synonym_value and then submit the list."
Agent -> ModelServer: POST /v1/responses (tools available: get_synonym_value, extract_synonym_values)
Model decides to call a tool:
  function_call: get_synonym_value({"synonym": "Warm"})
Agent -> ResourcesServer: POST /get_synonym_value {"synonym": "Warm"}
ResourcesServer -> Agent:
  {"synonym_value": 407}
Agent -> ModelServer: POST /v1/responses (now includes tool output 407)
Model calls next tool:
  function_call: get_synonym_value({"synonym": "Blazing"})
Agent -> ResourcesServer: POST /get_synonym_value {"synonym": "Blazing"}
ResourcesServer -> Agent:
  {"synonym_value": 711}
Agent -> ModelServer: POST /v1/responses (now includes tool output 711)
Model submits final answer via the "submit" tool:
  function_call: extract_synonym_values({"synonym_values": [407, 711]})
[Episode end -> grading]
Agent -> ResourcesServer: POST /verify (includes the full response trace + ground truth fields)
ResourcesServer:
  - parses the extract_synonym_values(...) arguments -> actual=[407, 711]
  - compares to expected_synonym_values from the dataset row
  - returns reward: 1.0 if exact match else 0.0

Stateful Environment →