Resources Server Implementation

This page covers the Resources Server implementation for the Workplace Assistant environment. The full workflow — task data preparation, agent/model configuration, rollout collection, and training — follows the same steps as the single-step tutorial. What changes here is the scale and complexity of the Resources Server.

← Generating Training Data

Episode Flow

Goal (what the agent is learning)
  - Learn realistic multi-step tool calling workflows (search -> decide -> act) with persistent per-episode state.
Inputs
  - user instruction + tool schemas (company_directory + email/calendar/analytics/...)
  - ground truth calls (or other grading metadata) for verify()
Flow (state is stored per session_id inside the ResourcesServer)
  1) POST ResourcesServer /seed_session
     - initializes toolkits + in-memory data for this session_id
  2) POST ModelServer /v1/responses
     - model emits one or more function_call tool invocations (e.g., email_search_emails, email_reply_email)
  3) POST ResourcesServer /{tool_name}
     - executes the tool against the session's state and returns output/errors
     - agent appends tool outputs back into the conversation
  4) POST ResourcesServer /verify
     - extracts predicted function calls from the response and grades the replayed outcome, returning reward in [0, 1]

Implementation

This Resources Server introduces three patterns not seen in the earlier tutorials:

Dynamic routing — a single /{path} catch-all endpoint dispatches to any tool function, so you don’t need to register each tool individually.
Per-session toolkit initialization — seed_session() creates an independent set of toolkits and data for each episode, so concurrent rollouts don’t interfere.
State-based verification — verify() extracts the agent’s function calls, replays them in a fresh environment alongside the ground truth, and compares the resulting state rather than the exact call sequence.

File (simplified from resources_servers/workplace_assistant/app.py):

1 # simplified
2 from typing import Any, Dict
3 
4 from fastapi import FastAPI, HTTPException, Request
5 from pydantic import BaseModel, ConfigDict, Field
6 
7 from nemo_gym.base_resources_server import (
8     BaseResourcesServerConfig,
9     BaseSeedSessionRequest,
10     BaseSeedSessionResponse,
11     BaseVerifyRequest,
12     BaseVerifyResponse,
13     SimpleResourcesServer,
14 )
15 from nemo_gym.server_utils import SESSION_ID_KEY
16 from resources_servers.workplace_assistant.utils import get_tools, is_correct
17 
18 class WorkbenchResourcesServerConfig(BaseResourcesServerConfig):
19     pass
20 
21 class WorkbenchRequest(BaseModel):
22     model_config = ConfigDict(extra="allow")
23 
24 class WorkbenchResponse(BaseModel):
25     model_config = ConfigDict(extra="allow")
26 
27 class WorkbenchVerifyRequest(BaseVerifyRequest):
28     ground_truth: list[Dict[str, str]] | str
29     id: int
30     category: str
31     environment_name: str
32 
33 class WorkbenchVerifyResponse(BaseVerifyResponse):
34     pass
35 
36 class WorkbenchResourcesServer(SimpleResourcesServer):
37     config: WorkbenchResourcesServerConfig
38     session_id_to_tool_env: Dict[str, Any] = Field(default_factory=dict)
39 
40     def setup_webserver(self) -> FastAPI:
41         app = super().setup_webserver()
42         # Dynamic routing: any path becomes a tool call
43         app.post("/{path}")(self.route_to_python_function)
44         return app
45 
46     async def seed_session(
47         self,
48         request: Request,
49         body: BaseSeedSessionRequest
50     ) -> BaseSeedSessionResponse:
51         session_id = request.session[SESSION_ID_KEY]
52 
53         # Initialize multiple toolkits for this session
54         toolkits = [
55             "email",
56             "calendar",
57             "analytics",
58             "project_management",
59             "customer_relationship_manager",
60         ]
61         self.session_id_to_tool_env[session_id] = get_tools(toolkits)
62         return BaseSeedSessionResponse()
63 
64     # Generic tool router - dispatches to Python functions dynamically
65     async def route_to_python_function(
66         self,
67         path: str,
68         body: WorkbenchRequest,
69         request: Request
70     ) -> WorkbenchResponse:
71         session_id = request.session[SESSION_ID_KEY]
72 
73         if session_id not in self.session_id_to_tool_env:
74             raise HTTPException(
75                 status_code=400,
76                 detail="Session not initialized. Please call seed_session first.",
77             )
78 
79         tool_env = self.session_id_to_tool_env[session_id]
80         args = {k: v for k, v in body.model_dump(exclude_unset=True).items() if v is not None}
81 
82         try:
83             function = tool_env["functions"][path]
84             result = function(**args)
85             return WorkbenchResponse(output=result)
86         except Exception as e:
87             # Return error to model so it can self-correct
88             return WorkbenchResponse(output=f"Error executing tool '{path}': {str(e)}")
89 
90     async def verify(self, body: WorkbenchVerifyRequest) -> WorkbenchVerifyResponse:
91         ground_truth = body.ground_truth
92         response = body.response.output
93 
94         # Extract function calls from response
95         predicted_function_calls = [
96             message.model_dump()
97             for message in response
98             if message.type == "function_call"
99         ]
100 
101         # Compute reward using custom evaluation function
102         total_score = is_correct(predicted_function_calls, ground_truth, None) * 1.0
103         return WorkbenchVerifyResponse(**body.model_dump(), reward=total_score)
104 
105 if __name__ == "__main__":
106     WorkbenchResourcesServer.run_webserver()

Key Pattern

Dynamic routing with /{path} allows the environment to expose an arbitrary number of tools without hardcoding each endpoint. The route_to_python_function method dispatches incoming requests to Python functions in the per-session tool_env["functions"] dictionary.

The /{path} catch-all route must be registered after super().setup_webserver(). The parent method registers /seed_session and /verify — if your catch-all is registered first, it will intercept those requests and break the server lifecycle.

What does get_tools() return?

get_tools(toolkits) initializes a dictionary containing:

"functions": A mapping of tool names (e.g. "email_search_emails") to Python callables
Per-toolkit in-memory data (DataFrames for emails, calendar events, analytics, etc.)

Each session gets its own independent copy of this state, so tool calls in one episode cannot affect another.

What does is_correct() do?

is_correct(predicted_calls, ground_truth, env) performs state-based verification:

Replays the predicted tool calls against a fresh environment
Replays the ground-truth calls against another fresh environment
Compares five specific mutable DataFrames: email._emails, calendar._calendar_events, analytics._plots_data, project_management._project_tasks, and customer_relationship_manager._crm_data (mostly case-insensitive)
Returns 1.0 if all five match, 0.0 otherwise

Note that read-only state (e.g. company_directory) is not compared — only mutable state that tools can modify. Tool execution errors during replay are caught and skipped rather than treated as immediate failures.

This is more flexible than trajectory matching because it rewards correct outcomes regardless of the specific tool call sequence.

Rollout Transcript

[Episode start]
Agent -> ResourcesServer: POST /seed_session
  (ResourcesServer initializes a fresh in-memory "workbench" for this session_id:
   company_directory + email/calendar/analytics/project_management/crm toolkits + their data)
User: "Reply to Carlos's last email about 'Task Update' with 'Thanks, I'll follow up tomorrow.'"
Agent -> ModelServer: POST /v1/responses (many tools available)
Model calls tools to reach the goal (one possible path):
  function_call: email_search_emails({"query": "carlos Task Update"})
Agent -> ResourcesServer: POST /email_search_emails {"query": "carlos Task Update"}
ResourcesServer -> Agent:
  {"output": {"emails": [...], "pagination": {...}}}
Agent -> ModelServer: POST /v1/responses (now includes search results)
Model calls:
  function_call: email_reply_email({"email_id": "00000057", "body": "Thanks, I'll follow up tomorrow."})
Agent -> ResourcesServer: POST /email_reply_email {"email_id": "00000057", "body": "..."}
ResourcesServer -> Agent:
  {"output": "Email replied successfully."}
[Episode end -> grading]
Agent -> ResourcesServer: POST /verify (includes response + ground truth calls for this task)
ResourcesServer:
  - extracts predicted function calls from the response (ignores text output)
  - replays predicted and ground-truth calls, compares final state
  - returns reward 1.0 or 0.0

Verification: Trajectory Matching vs State Matching

There are two common ways to grade tool-using agents:

1. Trajectory Matching (Sequence Matching)

Compare the exact tool call sequence (names + arguments, sometimes order) against a reference trajectory.

Pros: Simple to implement; easy to debug.
Cons: Brittle — penalizes alternative correct paths (different searches, different ordering, equivalent updates).

2. State Matching (Outcome Matching)

Execute the agent’s predicted calls in a fresh sandbox, execute the ground truth calls in another fresh sandbox, then compare the final environment state.

Pros: Rewards correct outcomes even when the path differs; better reflects “did the work get done?”
Cons: Requires you to define what “state” is (tables, files, DB rows, etc.) and how to compare it (case sensitivity, ordering, floating-point tolerance).

What Workplace Assistant Uses

Workplace Assistant uses state matching. Its verify() extracts only the function_call items from the response (text output is ignored for scoring), then calls is_correct(...), which:

Replays predicted calls and ground truth calls separately (fresh tool env each time)
Compares five mutable DataFrames (email, calendar, analytics plots, project management tasks, CRM data) mostly case-insensitive

This choice makes sense because workplace tasks often have multiple valid tool sequences that reach the same correct final state.

← Back to Workplace Assistant