This page covers the Resources Server implementation for the Workplace Assistant environment. The full workflow — task data preparation, agent/model configuration, rollout collection, and training — follows the same steps as the single-step tutorial. What changes here is the scale and complexity of the Resources Server.
← Generating Training DataThis Resources Server introduces three patterns not seen in the earlier tutorials:
/{path} catch-all endpoint dispatches to any tool function, so you don’t need to register each tool individually.seed_session() creates an independent set of toolkits and data for each episode, so concurrent rollouts don’t interfere.verify() extracts the agent’s function calls, replays them in a fresh environment alongside the ground truth, and compares the resulting state rather than the exact call sequence.File (simplified from resources_servers/workplace_assistant/app.py):
Dynamic routing with /{path} allows the environment to expose an arbitrary number of tools without hardcoding each endpoint. The route_to_python_function method dispatches incoming requests to Python functions in the per-session tool_env["functions"] dictionary.
The /{path} catch-all route must be registered after super().setup_webserver(). The parent method registers /seed_session and /verify — if your catch-all is registered first, it will intercept those requests and break the server lifecycle.
What does get_tools() return?
get_tools(toolkits) initializes a dictionary containing:
"functions": A mapping of tool names (e.g. "email_search_emails") to Python callablesEach session gets its own independent copy of this state, so tool calls in one episode cannot affect another.
What does is_correct() do?
is_correct(predicted_calls, ground_truth, env) performs state-based verification:
email._emails, calendar._calendar_events, analytics._plots_data, project_management._project_tasks, and customer_relationship_manager._crm_data (mostly case-insensitive)1.0 if all five match, 0.0 otherwiseNote that read-only state (e.g. company_directory) is not compared — only mutable state that tools can modify. Tool execution errors during replay are caught and skipped rather than treated as immediate failures.
This is more flexible than trajectory matching because it rewards correct outcomes regardless of the specific tool call sequence.
There are two common ways to grade tool-using agents:
Compare the exact tool call sequence (names + arguments, sometimes order) against a reference trajectory.
Execute the agent’s predicted calls in a fresh sandbox, execute the ground truth calls in another fresh sandbox, then compare the final environment state.
Workplace Assistant uses state matching. Its verify() extracts only the function_call items from the response (text output is ignored for scoring), then calls is_correct(...), which:
This choice makes sense because workplace tasks often have multiple valid tool sequences that reach the same correct final state.