nemo_gym.reward_profile#

Module Contents#

Classes#

RewardProfileConfig

RewardProfiler

AggregateMetricsMixin

Mixin providing compute_metrics/get_key_metrics hooks and the aggregate_metrics endpoint.

Functions#

compute_pass_majority_metrics

Compute pass@k, majority@k, no_answer, and variance statistics from grouped task results.

add_avg_sample_std_dev

Add avg_sample_std_dev statistics to an existing metrics dict.

compute_subset_metrics

Group tasks by a field and compute pass@k metrics per subset.

highest_k_metrics

Select the highest-k entries matching a metric pattern.

_group_by_task

Group verify responses by task index, returning a list of per-task rollout lists.

compute_aggregate_metrics

Shared aggregation logic for /aggregate_metrics.

reward_profile

API#

class nemo_gym.reward_profile.RewardProfileConfig(/, **data: typing.Any)[source]#

Bases: nemo_gym.config_types.BaseNeMoGymCLIConfig

materialized_inputs_jsonl_fpath: str#

‘Field(…)’

rollouts_jsonl_fpath: str#

‘Field(…)’

class nemo_gym.reward_profile.RewardProfiler[source]#
histogram(data: pandas.Series) Optional[wandb.Histogram][source]#
describe_dataframe(df: pandas.DataFrame) pandas.DataFrame[source]#
calculate_metrics_single_df(
grouped_df: pandas.core.groupby.generic.DataFrameGroupBy,
) List[Dict[str, Any]][source]#
profile_from_data(
rows: List[Dict[str, Any]],
results: List[Dict[str, Any]],
) Tuple[List[Dict[str, Any]], List[Dict[str, Any]]][source]#
prepare_for_serialization(
metrics: List[Dict],
) List[Dict][source]#

Non-destructively cleans metrics output by RewardProfiler for downstream serialization.

write_to_disk(
group_level_metrics: List[Dict[str, Any]],
agent_level_metrics: List[Dict[str, Any]],
base_output_fpath: pathlib.Path,
) Tuple[pathlib.Path, pathlib.Path][source]#
nemo_gym.reward_profile.compute_pass_majority_metrics(
tasks: List[List[Dict[str, Any]]],
score_fn: Optional[Any] = None,
answer_key: Optional[str] = None,
) Tuple[Dict[str, Any], List[List[Dict[str, float]]], List[str], int][source]#

Compute pass@k, majority@k, no_answer, and variance statistics from grouped task results.

Shared utility for any resource server’s compute_metrics() override.

Parameters:
  • tasks – tasks[i] is a list of rollout dicts for task i.

  • score_fn – Callable(result_dict) -> Dict[str, float|bool] returning named scores. Defaults to lambda r: {"accuracy": r["reward"]}.

  • answer_key – Field name for extracted answer (enables majority@k and no_answer). If None, majority@k and no_answer are skipped.

Returns:

Metrics, all_score_dicts, score_names, max_k Flat dict of metrics keyed as {agg_mode}/{score_name}:

  • pass@{k}/{name}: combinatorial pass@k (binary) or max-of-k (continuous)

  • pass@1[avg-of-{k}]/{name}: mean score across first k rollouts, averaged across tasks

  • majority@{k}/{name}: majority-vote accuracy (only if answer_key is set)

  • pass@{k}/no_answer, majority@{k}/no_answer: fraction with no extracted answer

  • pass@1[avg-of-{k}]/{name}/std_dev_across_runs, …/std_err_across_runs: variance stats

All accuracy values are percentages (0-100).

nemo_gym.reward_profile.add_avg_sample_std_dev(
metrics: Dict[str, Any],
all_score_dicts: List[List[Dict[str, float]]],
score_names: list,
max_k: int,
) None[source]#

Add avg_sample_std_dev statistics to an existing metrics dict.

Computes the average of per-task standard deviations across k rollouts — a measure of within-task variance that complements the across-run variance (std_dev_across_runs).

Modifies metrics in place.

nemo_gym.reward_profile.compute_subset_metrics(
tasks: List[List[Dict[str, Any]]],
subset_key: str,
score_fn: Optional[Any] = None,
answer_key: Optional[str] = None,
) Dict[str, Any][source]#

Group tasks by a field and compute pass@k metrics per subset.

Returns flat dict with subset-prefixed keys, e.g. "easy/pass@1/accuracy". Skips the per_sample_aggregate key from each subset’s metrics.

Parameters:
  • tasks – tasks[i] is a list of rollout dicts for task i.

  • subset_key – Field name in rollout dicts to group by (e.g. "difficulty").

  • score_fn – Passed through to compute_pass_majority_metrics.

  • answer_key – Passed through to compute_pass_majority_metrics.

nemo_gym.reward_profile.highest_k_metrics(
agent_metrics: Dict[str, Any],
pattern: str,
score_names: Optional[List[str]] = None,
exclude_names: Optional[List[str]] = None,
) Dict[str, Any][source]#

Select the highest-k entries matching a metric pattern.

Finds all keys matching pattern (with {k} as the k placeholder), determines the highest k value, and returns all entries at that k.

Parameters:
  • agent_metrics – Full agent metrics dict.

  • pattern – Pattern with {k} placeholder, e.g. "pass@{k}" or "pass@1[avg-of-{k}]".

  • score_names – If provided, only return entries whose score name (after the last /) is in this list. Stat suffixes (std_dev, std_err, avg_sample) are always excluded.

  • exclude_names – Score names to exclude (e.g. ["no_answer"]). Applied after score_names.

Returns:

Dict of matching metrics at the highest k, e.g. {"pass@32/accuracy": 95.0}.

Example::

# Get highest-k pass@k for accuracy only
highest_k_metrics(am, "pass@{k}", score_names=["accuracy"])
# → {"pass@32/accuracy": 95.0}

# Get highest-k pass@1[avg-of-k] for all scores except no_answer, without stats
highest_k_metrics(am, "pass@1[avg-of-{k}]", exclude_names=["no_answer"])
# → {"pass@1[avg-of-32]/accuracy": 94.5, "pass@1[avg-of-32]/symbolic_accuracy": 93.2}
class nemo_gym.reward_profile.AggregateMetricsMixin[source]#

Mixin providing compute_metrics/get_key_metrics hooks and the aggregate_metrics endpoint.

Inherited by both SimpleResourcesServer and SimpleResponsesAPIAgent so that benchmark-specific metric logic can live on either server type.

compute_metrics(
tasks: List[List[Dict[str, Any]]],
) Dict[str, Any][source]#

Override to compute custom metrics from all verify responses.

Receives verify responses grouped by task: tasks[i] is a list of rollout dicts for task i. Each dict has at minimum reward, plus any custom fields from the verify response (e.g. symbolic_correct, judgement-gen-base).

Use for metrics that need the full dataset at once:

  • Confidence intervals (ArenaMetrics)

  • Cross-task statistics (std_dev_across_runs)

  • pass@k with proper combinatorial computation

The returned dict is merged into agent_metrics. Default: empty dict (no additional metrics).

get_key_metrics(
agent_metrics: Dict[str, Any],
) Dict[str, Any][source]#

Override to select headline metrics for this benchmark.

Default: all mean/* entries from agent_metrics.

nemo_gym.reward_profile._group_by_task(
verify_responses: List[Dict[str, Any]],
) List[List[Dict[str, Any]]][source]#

Group verify responses by task index, returning a list of per-task rollout lists.

nemo_gym.reward_profile.compute_aggregate_metrics(
verify_responses: List[Dict[str, Any]],
compute_metrics_fn=None,
get_key_metrics_fn=None,
) nemo_gym.config_types.AggregateMetrics[source]#

Shared aggregation logic for /aggregate_metrics.

RewardProfiler runs with defaults to produce baseline stats (mean/max/min/median/std) for both group-level (per-task) and agent-level metrics.

Optionally accepts custom functions for benchmark-specific customization:

  • compute_metrics_fn: receives ALL verify responses grouped by task (List[List[Dict]]) for metrics that need the full dataset (e.g. confidence intervals, cross-task statistics, pass@k). Returned dict is merged into agent_metrics.

  • get_key_metrics_fn: select headline metrics from agent_metrics

nemo_gym.reward_profile.reward_profile()[source]#