Resources Server Implementation
This page covers the Resources Server implementation for the Workplace Assistant environment. The full workflow — task data preparation, agent/model configuration, rollout collection, and training — follows the same steps as the single-step tutorial. What changes here is the scale and complexity of the Resources Server.
← Generating Training DataEpisode Flow
Implementation
This Resources Server introduces three patterns not seen in the earlier tutorials:
- Dynamic routing — a single
/{path}catch-all endpoint dispatches to any tool function, so you don’t need to register each tool individually. - Per-session toolkit initialization —
seed_session()creates an independent set of toolkits and data for each episode, so concurrent rollouts don’t interfere. - State-based verification —
verify()extracts the agent’s function calls, replays them in a fresh environment alongside the ground truth, and compares the resulting state rather than the exact call sequence.
File (simplified from resources_servers/workplace_assistant/app.py):
Key Pattern
Dynamic routing with /{path} allows the environment to expose an arbitrary number of tools without hardcoding each endpoint. The route_to_python_function method dispatches incoming requests to Python functions in the per-session tool_env["functions"] dictionary.
The /{path} catch-all route must be registered after super().setup_webserver(). The parent method registers /seed_session and /verify — if your catch-all is registered first, it will intercept those requests and break the server lifecycle.
What does get_tools() return?
What does get_tools() return?
get_tools(toolkits) initializes a dictionary containing:
"functions": A mapping of tool names (e.g."email_search_emails") to Python callables- Per-toolkit in-memory data (DataFrames for emails, calendar events, analytics, etc.)
Each session gets its own independent copy of this state, so tool calls in one episode cannot affect another.
What does is_correct() do?
What does is_correct() do?
is_correct(predicted_calls, ground_truth, env) performs state-based verification:
- Replays the predicted tool calls against a fresh environment
- Replays the ground-truth calls against another fresh environment
- Compares five specific mutable DataFrames:
email._emails,calendar._calendar_events,analytics._plots_data,project_management._project_tasks, andcustomer_relationship_manager._crm_data(mostly case-insensitive) - Returns
1.0if all five match,0.0otherwise
Note that read-only state (e.g. company_directory) is not compared — only mutable state that tools can modify. Tool execution errors during replay are caught and skipped rather than treated as immediate failures.
This is more flexible than trajectory matching because it rewards correct outcomes regardless of the specific tool call sequence.
Rollout Transcript
Verification: Trajectory Matching vs State Matching
There are two common ways to grade tool-using agents:
1. Trajectory Matching (Sequence Matching)
Compare the exact tool call sequence (names + arguments, sometimes order) against a reference trajectory.
- Pros: Simple to implement; easy to debug.
- Cons: Brittle — penalizes alternative correct paths (different searches, different ordering, equivalent updates).
2. State Matching (Outcome Matching)
Execute the agent’s predicted calls in a fresh sandbox, execute the ground truth calls in another fresh sandbox, then compare the final environment state.
- Pros: Rewards correct outcomes even when the path differs; better reflects “did the work get done?”
- Cons: Requires you to define what “state” is (tables, files, DB rows, etc.) and how to compare it (case sensitivity, ordering, floating-point tolerance).
What Workplace Assistant Uses
Workplace Assistant uses state matching. Its verify() extracts only the function_call items from the response (text output is ignored for scoring), then calls is_correct(...), which:
- Replays predicted calls and ground truth calls separately (fresh tool env each time)
- Compares five mutable DataFrames (email, calendar, analytics plots, project management tasks, CRM data) mostly case-insensitive
This choice makes sense because workplace tasks often have multiple valid tool sequences that reach the same correct final state.
← Back to Workplace Assistant