nemo_gym.reward_profile#
Module Contents#
Classes#
Mixin providing compute_metrics/get_key_metrics hooks and the aggregate_metrics endpoint. |
Functions#
Compute pass@k, majority@k, no_answer, and variance statistics from grouped task results. |
|
Add avg_sample_std_dev statistics to an existing metrics dict. |
|
Group tasks by a field and compute pass@k metrics per subset. |
|
Select the highest-k entries matching a metric pattern. |
|
Group verify responses by task index, returning a list of per-task rollout lists. |
|
Shared aggregation logic for /aggregate_metrics. |
|
API#
- class nemo_gym.reward_profile.RewardProfileConfig(/, **data: typing.Any)[source]#
Bases:
nemo_gym.config_types.BaseNeMoGymCLIConfig- materialized_inputs_jsonl_fpath: str#
‘Field(…)’
- rollouts_jsonl_fpath: str#
‘Field(…)’
- class nemo_gym.reward_profile.RewardProfiler[source]#
-
- calculate_metrics_single_df(
- grouped_df: pandas.core.groupby.generic.DataFrameGroupBy,
- profile_from_data(
- rows: List[Dict[str, Any]],
- results: List[Dict[str, Any]],
- nemo_gym.reward_profile.compute_pass_majority_metrics(
- tasks: List[List[Dict[str, Any]]],
- score_fn: Optional[Any] = None,
- answer_key: Optional[str] = None,
Compute pass@k, majority@k, no_answer, and variance statistics from grouped task results.
Shared utility for any resource server’s compute_metrics() override.
- Parameters:
tasks – tasks[i] is a list of rollout dicts for task i.
score_fn – Callable(result_dict) -> Dict[str, float|bool] returning named scores. Defaults to
lambda r: {"accuracy": r["reward"]}.answer_key – Field name for extracted answer (enables majority@k and no_answer). If None, majority@k and no_answer are skipped.
- Returns:
Metrics, all_score_dicts, score_names, max_k Flat dict of metrics keyed as
{agg_mode}/{score_name}:pass@{k}/{name}: combinatorial pass@k (binary) or max-of-k (continuous)pass@1[avg-of-{k}]/{name}: mean score across first k rollouts, averaged across tasksmajority@{k}/{name}: majority-vote accuracy (only if answer_key is set)pass@{k}/no_answer,majority@{k}/no_answer: fraction with no extracted answerpass@1[avg-of-{k}]/{name}/std_dev_across_runs,…/std_err_across_runs: variance stats
All accuracy values are percentages (0-100).
- nemo_gym.reward_profile.add_avg_sample_std_dev(
- metrics: Dict[str, Any],
- all_score_dicts: List[List[Dict[str, float]]],
- score_names: list,
- max_k: int,
Add avg_sample_std_dev statistics to an existing metrics dict.
Computes the average of per-task standard deviations across k rollouts — a measure of within-task variance that complements the across-run variance (std_dev_across_runs).
Modifies
metricsin place.
- nemo_gym.reward_profile.compute_subset_metrics(
- tasks: List[List[Dict[str, Any]]],
- subset_key: str,
- score_fn: Optional[Any] = None,
- answer_key: Optional[str] = None,
Group tasks by a field and compute pass@k metrics per subset.
Returns flat dict with subset-prefixed keys, e.g.
"easy/pass@1/accuracy". Skips theper_sample_aggregatekey from each subset’s metrics.- Parameters:
tasks – tasks[i] is a list of rollout dicts for task i.
subset_key – Field name in rollout dicts to group by (e.g.
"difficulty").score_fn – Passed through to
compute_pass_majority_metrics.answer_key – Passed through to
compute_pass_majority_metrics.
- nemo_gym.reward_profile.highest_k_metrics(
- agent_metrics: Dict[str, Any],
- pattern: str,
- score_names: Optional[List[str]] = None,
- exclude_names: Optional[List[str]] = None,
Select the highest-k entries matching a metric pattern.
Finds all keys matching
pattern(with{k}as the k placeholder), determines the highest k value, and returns all entries at that k.- Parameters:
agent_metrics – Full agent metrics dict.
pattern – Pattern with
{k}placeholder, e.g."pass@{k}"or"pass@1[avg-of-{k}]".score_names – If provided, only return entries whose score name (after the last
/) is in this list. Stat suffixes (std_dev, std_err, avg_sample) are always excluded.exclude_names – Score names to exclude (e.g.
["no_answer"]). Applied after score_names.
- Returns:
Dict of matching metrics at the highest k, e.g.
{"pass@32/accuracy": 95.0}.
Example::
# Get highest-k pass@k for accuracy only highest_k_metrics(am, "pass@{k}", score_names=["accuracy"]) # → {"pass@32/accuracy": 95.0} # Get highest-k pass@1[avg-of-k] for all scores except no_answer, without stats highest_k_metrics(am, "pass@1[avg-of-{k}]", exclude_names=["no_answer"]) # → {"pass@1[avg-of-32]/accuracy": 94.5, "pass@1[avg-of-32]/symbolic_accuracy": 93.2}
- class nemo_gym.reward_profile.AggregateMetricsMixin[source]#
Mixin providing compute_metrics/get_key_metrics hooks and the aggregate_metrics endpoint.
Inherited by both SimpleResourcesServer and SimpleResponsesAPIAgent so that benchmark-specific metric logic can live on either server type.
- compute_metrics(
- tasks: List[List[Dict[str, Any]]],
Override to compute custom metrics from all verify responses.
Receives verify responses grouped by task: tasks[i] is a list of rollout dicts for task i. Each dict has at minimum reward, plus any custom fields from the verify response (e.g. symbolic_correct, judgement-gen-base).
Use for metrics that need the full dataset at once:
Confidence intervals (ArenaMetrics)
Cross-task statistics (std_dev_across_runs)
pass@k with proper combinatorial computation
The returned dict is merged into agent_metrics. Default: empty dict (no additional metrics).
- nemo_gym.reward_profile._group_by_task(
- verify_responses: List[Dict[str, Any]],
Group verify responses by task index, returning a list of per-task rollout lists.
- nemo_gym.reward_profile.compute_aggregate_metrics(
- verify_responses: List[Dict[str, Any]],
- compute_metrics_fn=None,
- get_key_metrics_fn=None,
Shared aggregation logic for /aggregate_metrics.
RewardProfiler runs with defaults to produce baseline stats (mean/max/min/median/std) for both group-level (per-task) and agent-level metrics.
Optionally accepts custom functions for benchmark-specific customization:
compute_metrics_fn: receives ALL verify responses grouped by task (List[List[Dict]]) for metrics that need the full dataset (e.g. confidence intervals, cross-task statistics, pass@k). Returned dict is merged into agent_metrics.
get_key_metrics_fn: select headline metrics from agent_metrics