Aggregate Metrics

View as Markdown

After rollout collection, NeMo Gym computes aggregate metrics for each agent by calling the /aggregate_metrics endpoint on the agent server. The results are written to a single _aggregate_metrics.json file.


How It Works

  1. Rollouts completeng_collect_rollouts gathers verify responses (reward + custom fields) for every task/rollout pair.
  2. Group by agent — responses are partitioned by agent name.
  3. Call /aggregate_metrics — for each agent, the stripped verify responses are POSTed to the agent’s /aggregate_metrics endpoint.
  4. Compute stats — per-task and overall statistics (mean, max, min, median, std) are computed for every numeric field. If the resources server overrides compute_metrics() or get_key_metrics(), those are called to add additional metrics.
  5. Write results — all per-agent metrics are written to <output>_aggregate_metrics.json.

Output Format

The output file is a JSON array with one entry per agent:

1[
2 {
3 "agent_ref": {"name": "my_agent"},
4 "agent_metrics": {
5 "mean/reward": 0.75,
6 "max/reward": 1.0,
7 "min/reward": 0.0,
8 "median/reward": 1.0,
9 "std/reward": 0.433
10 },
11 "key_metrics": {
12 "mean/reward": 0.75
13 },
14 "group_level_metrics": [
15 {"mean/reward": 1.0, "sample": {"...": "..."}},
16 {"mean/reward": 0.5, "sample": {"...": "..."}}
17 ]
18 }
19]
FieldDescription
agent_refAgent identity ({"name": "..."})
agent_metricsOverall stats across all rollouts, plus any custom metrics from compute_metrics()
key_metricsHeadline numbers (default: all mean/* entries from agent_metrics)
group_level_metricsPer-task breakdown — one entry per task with stats across that task’s rollouts

Custom Metrics

Override two hooks on your resources server to add custom metrics.

compute_metrics(tasks)

Receives all verify responses grouped by task. Use this for metrics that need the full dataset — pass@k, confidence intervals, cross-task statistics.

get_key_metrics(agent_metrics)

Selects headline numbers from the final agent_metrics dict. Default returns all mean/* entries.

Example: pass@k

1from nemo_gym.base_resources_server import (
2 BaseVerifyRequest,
3 BaseVerifyResponse,
4 SimpleResourcesServer,
5)
6
7class MathServer(SimpleResourcesServer):
8 async def verify(self, body: BaseVerifyRequest) -> BaseVerifyResponse:
9 # ... verification logic ...
10 pass
11
12 def compute_metrics(self, tasks):
13 n_tasks = len(tasks)
14 # pass@k: fraction of tasks where at least one rollout got reward=1
15 pass_at_k = sum(
16 1 for rollouts in tasks if any(r["reward"] >= 1.0 for r in rollouts)
17 ) / n_tasks
18
19 # pass@1 (average of per-task mean rewards)
20 pass_at_1 = sum(
21 sum(r["reward"] for r in rollouts) / len(rollouts)
22 for rollouts in tasks
23 ) / n_tasks
24
25 return {"pass@k": pass_at_k, "pass@1": pass_at_1}
26
27 def get_key_metrics(self, agent_metrics):
28 return {
29 k: agent_metrics[k]
30 for k in ("pass@k", "pass@1")
31 if k in agent_metrics
32 }

Given 3 tasks with 4 rollouts each (task 0: all correct, task 1: all wrong, task 2: half correct), this produces:

1[
2 {
3 "agent_ref": {"name": "math_simple_agent"},
4 "agent_metrics": {
5 "mean/reward": 0.5,
6 "max/reward": 1.0,
7 "min/reward": 0.0,
8 "median/reward": 0.5,
9 "std/reward": 0.522,
10 "pass@k": 0.667,
11 "pass@1": 0.5
12 },
13 "key_metrics": {
14 "pass@k": 0.667,
15 "pass@1": 0.5
16 },
17 "group_level_metrics": [
18 {
19 "mean/reward": 1.0,
20 "max/reward": 1.0,
21 "min/reward": 1.0,
22 "median/reward": 1.0,
23 "std/reward": 0.0
24 },
25 {
26 "mean/reward": 0.0,
27 "max/reward": 0.0,
28 "min/reward": 0.0,
29 "median/reward": 0.0,
30 "std/reward": 0.0
31 },
32 {
33 "mean/reward": 0.5,
34 "max/reward": 1.0,
35 "min/reward": 0.0,
36 "median/reward": 0.5,
37 "std/reward": 0.577
38 }
39 ]
40 }
41]