nemo_gym.reward_profile
nemo_gym.reward_profile
Module Contents
Classes
Functions
API
Mixin providing compute_metrics/get_key_metrics hooks and the aggregate_metrics endpoint.
Inherited by both SimpleResourcesServer and SimpleResponsesAPIAgent so that benchmark-specific metric logic can live on either server type.
Override to compute custom metrics from all verify responses.
Receives verify responses grouped by task: tasks[i] is a list of rollout dicts for task i. Each dict has at minimum reward, plus any custom fields from the verify response (e.g. symbolic_correct, judgement-gen-base).
Use for metrics that need the full dataset at once:
- Confidence intervals (ArenaMetrics)
- Cross-task statistics (std_dev_across_runs)
- pass@k with proper combinatorial computation
The returned dict is merged into agent_metrics. Default: empty dict (no additional metrics).
Override to select headline metrics for this benchmark.
Default: all mean/* entries from agent_metrics.
Bases: BaseNeMoGymCLIConfig
Non-destructively cleans metrics output by RewardProfiler for downstream serialization.
Group verify responses by task index, returning a list of per-task rollout lists.
Add avg_sample_std_dev statistics to an existing metrics dict.
Computes the average of per-task standard deviations across k rollouts — a measure of within-task variance that complements the across-run variance (std_dev_across_runs).
Modifies metrics in place.
Shared aggregation logic for /aggregate_metrics.
RewardProfiler runs with defaults to produce baseline stats (mean/max/min/median/std) for both group-level (per-task) and agent-level metrics.
Compute pass@k, majority@k, no_answer, and variance statistics from grouped task results.
Shared utility for any resource server’s compute_metrics() override.
Parameters:
tasks[i] is a list of rollout dicts for task i.
Callable(result_dict) -> Dict[str, float|bool] returning named scores.
Defaults to lambda r: {"accuracy": r["reward"]}.
Field name for extracted answer (enables majority@k and no_answer). If None, majority@k and no_answer are skipped.
Returns: Dict[str, Any]
Metrics, all_score_dicts, score_names, max_k
Group tasks by a field and compute pass@k metrics per subset.
Returns flat dict with subset-prefixed keys, e.g. "easy/pass@1/accuracy".
Skips the per_sample_aggregate key from each subset’s metrics.
Parameters:
tasks[i] is a list of rollout dicts for task i.
Field name in rollout dicts to group by (e.g. "difficulty").
Passed through to compute_pass_majority_metrics.
Passed through to compute_pass_majority_metrics.
Select the highest-k entries matching a metric pattern.
Finds all keys matching pattern (with {k} as the k placeholder), determines the
highest k value, and returns all entries at that k.
Example::
Get highest-k pass@k for accuracy only
highest_k_metrics(am, “pass@{k}”, score_names=[“accuracy”])
→ {“pass@32/accuracy”: 95.0}
Get highest-k pass@1[avg-of-k] for all scores except no_answer, without stats
highest_k_metrics(am, “pass@1[avg-of-{k}]”, exclude_names=[“no_answer”])
→ {“pass@1[avg-of-32]/accuracy”: 94.5, “pass@1[avg-of-32]/symbolic_accuracy”: 93.2}
Parameters:
Full agent metrics dict.
Pattern with {k} placeholder, e.g. "pass@{k}" or "pass@1[avg-of-{k}]".
If provided, only return entries whose score name (after the last /)
is in this list. Stat suffixes (std_dev, std_err, avg_sample) are always excluded.
Score names to exclude (e.g. ["no_answer"]). Applied after score_names.
Returns: Dict[str, Any]
Dict of matching metrics at the highest k, e.g. {"pass@32/accuracy": 95.0}.