Aggregate Metrics

After rollout collection, NeMo Gym computes aggregate metrics for each agent by calling the /aggregate_metrics endpoint on the agent server. The results are written to a single _aggregate_metrics.json file.

How It Works

Rollouts complete — ng_collect_rollouts gathers verify responses (reward + custom fields) for every task/rollout pair.
Group by agent — responses are partitioned by agent name.
Call /aggregate_metrics — for each agent, the stripped verify responses are POSTed to the agent’s /aggregate_metrics endpoint.
Compute stats — per-task and overall statistics (mean, max, min, median, std) are computed for every numeric field. If the resources server overrides compute_metrics() or get_key_metrics(), those are called to add additional metrics.
Write results — all per-agent metrics are written to <output>_aggregate_metrics.json.

Output Format

The output file is a JSON array with one entry per agent:

1 [
2   {
3     "agent_ref": {"name": "my_agent"},
4     "agent_metrics": {
5       "mean/reward": 0.75,
6       "max/reward": 1.0,
7       "min/reward": 0.0,
8       "median/reward": 1.0,
9       "std/reward": 0.433
10     },
11     "key_metrics": {
12       "mean/reward": 0.75
13     },
14     "group_level_metrics": [
15       {"mean/reward": 1.0, "sample": {"...": "..."}},
16       {"mean/reward": 0.5, "sample": {"...": "..."}}
17     ]
18   }
19 ]

Field	Description
`agent_ref`	Agent identity (`{"name": "..."}`)
`agent_metrics`	Overall stats across all rollouts, plus any custom metrics from `compute_metrics()`
`key_metrics`	Headline numbers (default: all `mean/*` entries from `agent_metrics`)
`group_level_metrics`	Per-task breakdown — one entry per task with stats across that task’s rollouts

Custom Metrics

Override two hooks on your resources server to add custom metrics.

`compute_metrics(tasks)`

Receives all verify responses grouped by task. Use this for metrics that need the full dataset — pass@k, confidence intervals, cross-task statistics.

`get_key_metrics(agent_metrics)`

Selects headline numbers from the final agent_metrics dict. Default returns all mean/* entries.

Example: pass@k

1 from nemo_gym.base_resources_server import (
2     BaseVerifyRequest,
3     BaseVerifyResponse,
4     SimpleResourcesServer,
5 )
6 
7 class MathServer(SimpleResourcesServer):
8     async def verify(self, body: BaseVerifyRequest) -> BaseVerifyResponse:
9         # ... verification logic ...
10         pass
11 
12     def compute_metrics(self, tasks):
13         n_tasks = len(tasks)
14         # pass@k: fraction of tasks where at least one rollout got reward=1
15         pass_at_k = sum(
16             1 for rollouts in tasks if any(r["reward"] >= 1.0 for r in rollouts)
17         ) / n_tasks
18 
19         # pass@1 (average of per-task mean rewards)
20         pass_at_1 = sum(
21             sum(r["reward"] for r in rollouts) / len(rollouts)
22             for rollouts in tasks
23         ) / n_tasks
24 
25         return {"pass@k": pass_at_k, "pass@1": pass_at_1}
26 
27     def get_key_metrics(self, agent_metrics):
28         return {
29             k: agent_metrics[k]
30             for k in ("pass@k", "pass@1")
31             if k in agent_metrics
32         }

Given 3 tasks with 4 rollouts each (task 0: all correct, task 1: all wrong, task 2: half correct), this produces:

1 [
2   {
3     "agent_ref": {"name": "math_simple_agent"},
4     "agent_metrics": {
5       "mean/reward": 0.5,
6       "max/reward": 1.0,
7       "min/reward": 0.0,
8       "median/reward": 0.5,
9       "std/reward": 0.522,
10       "pass@k": 0.667,
11       "pass@1": 0.5
12     },
13     "key_metrics": {
14       "pass@k": 0.667,
15       "pass@1": 0.5
16     },
17     "group_level_metrics": [
18       {
19         "mean/reward": 1.0,
20         "max/reward": 1.0,
21         "min/reward": 1.0,
22         "median/reward": 1.0,
23         "std/reward": 0.0
24       },
25       {
26         "mean/reward": 0.0,
27         "max/reward": 0.0,
28         "min/reward": 0.0,
29         "median/reward": 0.0,
30         "std/reward": 0.0
31       },
32       {
33         "mean/reward": 0.5,
34         "max/reward": 1.0,
35         "min/reward": 0.0,
36         "median/reward": 0.5,
37         "std/reward": 0.577
38       }
39     ]
40   }
41 ]