Aggregate Metrics
After rollout collection, NeMo Gym computes aggregate metrics for each agent by calling the /aggregate_metrics endpoint on the agent server. The results are written to a single _aggregate_metrics.json file.
How It Works
- Rollouts complete —
ng_collect_rolloutsgathers verify responses (reward + custom fields) for every task/rollout pair. - Group by agent — responses are partitioned by agent name.
- Call
/aggregate_metrics— for each agent, the stripped verify responses are POSTed to the agent’s/aggregate_metricsendpoint. - Compute stats — per-task and overall statistics (
mean,max,min,median,std) are computed for every numeric field. If the resources server overridescompute_metrics()orget_key_metrics(), those are called to add additional metrics. - Write results — all per-agent metrics are written to
<output>_aggregate_metrics.json.
Output Format
The output file is a JSON array with one entry per agent:
Custom Metrics
Override two hooks on your resources server to add custom metrics.
compute_metrics(tasks)
Receives all verify responses grouped by task. Use this for metrics that need the full dataset — pass@k, confidence intervals, cross-task statistics.
get_key_metrics(agent_metrics)
Selects headline numbers from the final agent_metrics dict. Default returns all mean/* entries.
Example: pass@k
Given 3 tasks with 4 rollouts each (task 0: all correct, task 1: all wrong, task 2: half correct), this produces: