`nemo_gym.reward_profile`#

Module Contents#

Classes#

`RewardProfileConfig`
`RewardProfiler`
`AggregateMetricsMixin`	Mixin providing compute_metrics/get_key_metrics hooks and the aggregate_metrics endpoint.

Functions#

`compute_pass_majority_metrics`	Compute pass@k, majority@k, no_answer, and variance statistics from grouped task results.
`add_avg_sample_std_dev`	Add avg_sample_std_dev statistics to an existing metrics dict.
`compute_subset_metrics`	Group tasks by a field and compute pass@k metrics per subset.
`highest_k_metrics`	Select the highest-k entries matching a metric pattern.
`_group_by_task`	Group verify responses by task index, returning a list of per-task rollout lists.
`compute_aggregate_metrics`	Shared aggregation logic for /aggregate_metrics.
`reward_profile`

API#

class nemo_gym.reward_profile.RewardProfileConfig(/, **data: typing.Any)[source]#

Bases: nemo_gym.config_types.BaseNeMoGymCLIConfig

materialized_inputs_jsonl_fpath: str#: ‘Field(…)’

rollouts_jsonl_fpath: str#: ‘Field(…)’

class nemo_gym.reward_profile.RewardProfiler[source]#

histogram(data: pandas.Series) → Optional[wandb.Histogram][source]#

describe_dataframe(df: pandas.DataFrame) → pandas.DataFrame[source]#

calculate_metrics_single_df( grouped_df: pandas.core.groupby.generic.DataFrameGroupBy, ) → List[Dict[str, Any]][source]#

profile_from_data( rows: List[Dict[str, Any]], results: List[Dict[str, Any]], ) → Tuple[List[Dict[str, Any]], List[Dict[str, Any]]][source]#

prepare_for_serialization( metrics: List[Dict], ) → List[Dict][source]#: Non-destructively cleans metrics output by RewardProfiler for downstream serialization.

write_to_disk( group_level_metrics: List[Dict[str, Any]], agent_level_metrics: List[Dict[str, Any]], base_output_fpath: pathlib.Path, ) → Tuple[pathlib.Path, pathlib.Path][source]#

nemo_gym.reward_profile.compute_pass_majority_metrics( tasks: List[List[Dict[str, Any]]], score_fn: Optional[Any] = None, answer_key: Optional[str] = None, ) → Tuple[Dict[str, Any], List[List[Dict[str, float]]], List[str], int][source]#

Compute pass@k, majority@k, no_answer, and variance statistics from grouped task results.

Shared utility for any resource server’s compute_metrics() override.

Parameters:

tasks – tasks[i] is a list of rollout dicts for task i.
score_fn – Callable(result_dict) -> Dict[str, float|bool] returning named scores. Defaults to lambda r: {"accuracy": r["reward"]}.
answer_key – Field name for extracted answer (enables majority@k and no_answer). If None, majority@k and no_answer are skipped.

Returns:

Metrics, all_score_dicts, score_names, max_k Flat dict of metrics keyed as {agg_mode}/{score_name}:

pass@{k}/{name}: combinatorial pass@k (binary) or max-of-k (continuous)
pass@1[avg-of-{k}]/{name}: mean score across first k rollouts, averaged across tasks
majority@{k}/{name}: majority-vote accuracy (only if answer_key is set)
pass@{k}/no_answer, majority@{k}/no_answer: fraction with no extracted answer
pass@1[avg-of-{k}]/{name}/std_dev_across_runs, …/std_err_across_runs: variance stats

All accuracy values are percentages (0-100).

nemo_gym.reward_profile.add_avg_sample_std_dev( metrics: Dict[str, Any], all_score_dicts: List[List[Dict[str, float]]], score_names: list, max_k: int, ) → None[source]#

Add avg_sample_std_dev statistics to an existing metrics dict.

Computes the average of per-task standard deviations across k rollouts — a measure of within-task variance that complements the across-run variance (std_dev_across_runs).

Modifies metrics in place.

nemo_gym.reward_profile.compute_subset_metrics( tasks: List[List[Dict[str, Any]]], subset_key: str, score_fn: Optional[Any] = None, answer_key: Optional[str] = None, ) → Dict[str, Any][source]#

Group tasks by a field and compute pass@k metrics per subset.

Returns flat dict with subset-prefixed keys, e.g. "easy/pass@1/accuracy". Skips the per_sample_aggregate key from each subset’s metrics.

Parameters:

tasks – tasks[i] is a list of rollout dicts for task i.
subset_key – Field name in rollout dicts to group by (e.g. "difficulty").
score_fn – Passed through to compute_pass_majority_metrics.
answer_key – Passed through to compute_pass_majority_metrics.

nemo_gym.reward_profile.highest_k_metrics( agent_metrics: Dict[str, Any], pattern: str, score_names: Optional[List[str]] = None, exclude_names: Optional[List[str]] = None, ) → Dict[str, Any][source]#

Select the highest-k entries matching a metric pattern.

Finds all keys matching pattern (with {k} as the k placeholder), determines the highest k value, and returns all entries at that k.

Parameters:

agent_metrics – Full agent metrics dict.
pattern – Pattern with {k} placeholder, e.g. "pass@{k}" or "pass@1[avg-of-{k}]".
score_names – If provided, only return entries whose score name (after the last /) is in this list. Stat suffixes (std_dev, std_err, avg_sample) are always excluded.
exclude_names – Score names to exclude (e.g. ["no_answer"]). Applied after score_names.

Returns:

Dict of matching metrics at the highest k, e.g. {"pass@32/accuracy": 95.0}.

Example::

# Get highest-k pass@k for accuracy only
highest_k_metrics(am, "pass@{k}", score_names=["accuracy"])
# → {"pass@32/accuracy": 95.0}

# Get highest-k pass@1[avg-of-k] for all scores except no_answer, without stats
highest_k_metrics(am, "pass@1[avg-of-{k}]", exclude_names=["no_answer"])
# → {"pass@1[avg-of-32]/accuracy": 94.5, "pass@1[avg-of-32]/symbolic_accuracy": 93.2}

class nemo_gym.reward_profile.AggregateMetricsMixin[source]#

Mixin providing compute_metrics/get_key_metrics hooks and the aggregate_metrics endpoint.

Inherited by both SimpleResourcesServer and SimpleResponsesAPIAgent so that benchmark-specific metric logic can live on either server type.

compute_metrics( tasks: List[List[Dict[str, Any]]], ) → Dict[str, Any][source]#

Override to compute custom metrics from all verify responses.

Receives verify responses grouped by task: tasks[i] is a list of rollout dicts for task i. Each dict has at minimum reward, plus any custom fields from the verify response (e.g. symbolic_correct, judgement-gen-base).

Use for metrics that need the full dataset at once:

Confidence intervals (ArenaMetrics)
Cross-task statistics (std_dev_across_runs)
pass@k with proper combinatorial computation

The returned dict is merged into agent_metrics. Default: empty dict (no additional metrics).

get_key_metrics( agent_metrics: Dict[str, Any], ) → Dict[str, Any][source]#

Override to select headline metrics for this benchmark.

Default: all mean/* entries from agent_metrics.

nemo_gym.reward_profile._group_by_task( verify_responses: List[Dict[str, Any]], ) → List[List[Dict[str, Any]]][source]#: Group verify responses by task index, returning a list of per-task rollout lists.

nemo_gym.reward_profile.compute_aggregate_metrics( verify_responses: List[Dict[str, Any]], compute_metrics_fn=None, get_key_metrics_fn=None, ) → nemo_gym.config_types.AggregateMetrics[source]#

Shared aggregation logic for /aggregate_metrics.

RewardProfiler runs with defaults to produce baseline stats (mean/max/min/median/std) for both group-level (per-task) and agent-level metrics.

Optionally accepts custom functions for benchmark-specific customization:

compute_metrics_fn: receives ALL verify responses grouped by task (List[List[Dict]]) for metrics that need the full dataset (e.g. confidence intervals, cross-task statistics, pass@k). Returned dict is merged into agent_metrics.
get_key_metrics_fn: select headline metrics from agent_metrics

nemo_gym.reward_profile.reward_profile()[source]#

nemo_gym.reward_profile#

Module Contents#

Classes#

Functions#

API#

`nemo_gym.reward_profile`#