nemo_gym.reward_profile

View as Markdown

Module Contents

Classes

NameDescription
AggregateMetricsMixinMixin providing compute_metrics/get_key_metrics hooks and the aggregate_metrics endpoint.
RewardProfileConfig-
RewardProfiler-

Functions

NameDescription
_group_by_taskGroup verify responses by task index, returning a list of per-task rollout lists.
_rollout_key-
add_avg_sample_std_devAdd avg_sample_std_dev statistics to an existing metrics dict.
compute_aggregate_metricsShared aggregation logic for /aggregate_metrics.
compute_pass_majority_metricsCompute pass@k, majority@k, no_answer, and variance statistics from grouped task results.
compute_subset_metricsGroup tasks by a field and compute pass@k metrics per subset.
highest_k_metricsSelect the highest-k entries matching a metric pattern.
reward_profile-

API

class nemo_gym.reward_profile.AggregateMetricsMixin()

Mixin providing compute_metrics/get_key_metrics hooks and the aggregate_metrics endpoint.

Inherited by both SimpleResourcesServer and SimpleResponsesAPIAgent so that benchmark-specific metric logic can live on either server type.

nemo_gym.reward_profile.AggregateMetricsMixin.compute_metrics(
tasks: typing.List[typing.List[typing.Dict[str, typing.Any]]]
) -> typing.Dict[str, typing.Any]

Override to compute custom metrics from all verify responses.

Receives verify responses grouped by task: tasks[i] is a list of rollout dicts for task i. Each dict has at minimum reward, plus any custom fields from the verify response (e.g. symbolic_correct, judgement-gen-base).

Use for metrics that need the full dataset at once:

  • Confidence intervals (ArenaMetrics)
  • Cross-task statistics (std_dev_across_runs)
  • pass@k with proper combinatorial computation

The returned dict is merged into agent_metrics. Default: empty dict (no additional metrics).

nemo_gym.reward_profile.AggregateMetricsMixin.get_key_metrics(
agent_metrics: typing.Dict[str, typing.Any]
) -> typing.Dict[str, typing.Any]

Override to select headline metrics for this benchmark.

Default: all mean/* entries from agent_metrics.

class nemo_gym.reward_profile.RewardProfileConfig()

Bases: BaseNeMoGymCLIConfig

allow_partial_rollouts
bool
materialized_inputs_jsonl_fpath
str
rollouts_jsonl_fpath
str
class nemo_gym.reward_profile.RewardProfiler()
nemo_gym.reward_profile.RewardProfiler._index_by_rollout_key(
rows: typing.List[typing.Dict[str, typing.Any]],
name: str
) -> typing.Dict[typing.Tuple[int, int], typing.Dict[str, typing.Any]]
nemo_gym.reward_profile.RewardProfiler.align_rows_and_results(
rows: typing.List[typing.Dict[str, typing.Any]],
results: typing.List[typing.Dict[str, typing.Any]],
allow_partial_rollouts: bool = False
) -> typing.List[typing.Tuple[typing.Dict[str, typing.Any], typing.Dict[str, typing.Any]]]
nemo_gym.reward_profile.RewardProfiler.calculate_metrics_single_df(
grouped_df: pandas.core.groupby.generic.DataFrameGroupBy
) -> typing.List[typing.Dict[str, typing.Any]]
nemo_gym.reward_profile.RewardProfiler.describe_dataframe(
df: pandas.DataFrame
) -> pandas.DataFrame
nemo_gym.reward_profile.RewardProfiler.histogram(
data: pandas.Series
) -> typing.Optional[wandb.Histogram]
nemo_gym.reward_profile.RewardProfiler.prepare_for_serialization(
metrics: typing.List[typing.Dict]
) -> typing.List[typing.Dict]

Non-destructively cleans metrics output by RewardProfiler for downstream serialization.

nemo_gym.reward_profile.RewardProfiler.profile_completion_summary(
rows: typing.List[typing.Dict[str, typing.Any]],
results: typing.List[typing.Dict[str, typing.Any]]
) -> typing.Dict[str, typing.Any]
nemo_gym.reward_profile.RewardProfiler.profile_from_data(
rows: typing.List[typing.Dict[str, typing.Any]],
results: typing.List[typing.Dict[str, typing.Any]],
allow_partial_rollouts: bool = False
) -> typing.Tuple[typing.List[typing.Dict[str, typing.Any]], typing.List[typing.Dict[str, typing.Any]]]
nemo_gym.reward_profile.RewardProfiler.rollout_info_from_result(
result: typing.Dict[str, typing.Any]
) -> typing.Dict[str, typing.Any]
nemo_gym.reward_profile.RewardProfiler.write_to_disk(
group_level_metrics: typing.List[typing.Dict[str, typing.Any]],
agent_level_metrics: typing.List[typing.Dict[str, typing.Any]],
base_output_fpath: pathlib.Path
) -> typing.Tuple[pathlib.Path, pathlib.Path]
nemo_gym.reward_profile._group_by_task(
verify_responses: typing.List[typing.Dict[str, typing.Any]]
) -> typing.List[typing.List[typing.Dict[str, typing.Any]]]

Group verify responses by task index, returning a list of per-task rollout lists.

nemo_gym.reward_profile._rollout_key(
row: typing.Dict[str, typing.Any]
) -> typing.Tuple[int, int]
nemo_gym.reward_profile.add_avg_sample_std_dev(
metrics: typing.Dict[str, typing.Any],
all_score_dicts: typing.List[typing.List[typing.Dict[str, float]]],
score_names: list,
max_k: int
) -> None

Add avg_sample_std_dev statistics to an existing metrics dict.

Computes the average of per-task standard deviations across k rollouts — a measure of within-task variance that complements the across-run variance (std_dev_across_runs).

Modifies metrics in place.

nemo_gym.reward_profile.compute_aggregate_metrics(
verify_responses: typing.List[typing.Dict[str, typing.Any]],
compute_metrics_fn = None,
get_key_metrics_fn = None
) -> nemo_gym.config_types.AggregateMetrics

Shared aggregation logic for /aggregate_metrics.

RewardProfiler runs with defaults to produce baseline stats (mean/max/min/median/std) for both group-level (per-task) and agent-level metrics.

nemo_gym.reward_profile.compute_pass_majority_metrics(
tasks: typing.List[typing.List[typing.Dict[str, typing.Any]]],
score_fn: typing.Optional[typing.Any] = None,
answer_key: typing.Optional[str] = None
) -> typing.Tuple[typing.Dict[str, typing.Any], typing.List[typing.List[typing.Dict[str, float]]], typing.List[str], int]

Compute pass@k, majority@k, no_answer, and variance statistics from grouped task results.

Shared utility for any resource server’s compute_metrics() override.

Parameters:

tasks
List[List[Dict[str, Any]]]

tasks[i] is a list of rollout dicts for task i.

score_fn
Optional[Any]Defaults to None

Callable(result_dict) -> Dict[str, float|bool] returning named scores. Defaults to lambda r: {"accuracy": r["reward"]}.

answer_key
Optional[str]Defaults to None

Field name for extracted answer (enables majority@k and no_answer). If None, majority@k and no_answer are skipped.

Returns: Dict[str, Any]

Metrics, all_score_dicts, score_names, max_k

nemo_gym.reward_profile.compute_subset_metrics(
tasks: typing.List[typing.List[typing.Dict[str, typing.Any]]],
subset_key: str,
score_fn: typing.Optional[typing.Any] = None,
answer_key: typing.Optional[str] = None
) -> typing.Dict[str, typing.Any]

Group tasks by a field and compute pass@k metrics per subset.

Returns flat dict with subset-prefixed keys, e.g. "easy/pass@1/accuracy". Skips the per_sample_aggregate key from each subset’s metrics.

Parameters:

tasks
List[List[Dict[str, Any]]]

tasks[i] is a list of rollout dicts for task i.

subset_key
str

Field name in rollout dicts to group by (e.g. "difficulty").

score_fn
Optional[Any]Defaults to None

Passed through to compute_pass_majority_metrics.

answer_key
Optional[str]Defaults to None

Passed through to compute_pass_majority_metrics.

nemo_gym.reward_profile.highest_k_metrics(
agent_metrics: typing.Dict[str, typing.Any],
pattern: str,
score_names: typing.Optional[typing.List[str]] = None,
exclude_names: typing.Optional[typing.List[str]] = None
) -> typing.Dict[str, typing.Any]

Select the highest-k entries matching a metric pattern.

Finds all keys matching pattern (with {k} as the k placeholder), determines the highest k value, and returns all entries at that k.

Example::

Get highest-k pass@k for accuracy only

highest_k_metrics(am, “pass@{k}”, score_names=[“accuracy”])

→ {“pass@32/accuracy”: 95.0}

Get highest-k pass@1[avg-of-k] for all scores except no_answer, without stats

highest_k_metrics(am, “pass@1[avg-of-{k}]”, exclude_names=[“no_answer”])

→ {“pass@1[avg-of-32]/accuracy”: 94.5, “pass@1[avg-of-32]/symbolic_accuracy”: 93.2}

Parameters:

agent_metrics
Dict[str, Any]

Full agent metrics dict.

pattern
str

Pattern with {k} placeholder, e.g. "pass@{k}" or "pass@1[avg-of-{k}]".

score_names
Optional[List[str]]Defaults to None

If provided, only return entries whose score name (after the last /) is in this list. Stat suffixes (std_dev, std_err, avg_sample) are always excluded.

exclude_names
Optional[List[str]]Defaults to None

Score names to exclude (e.g. ["no_answer"]). Applied after score_names.

Returns: Dict[str, Any]

Dict of matching metrics at the highest k, e.g. {"pass@32/accuracy": 95.0}.

nemo_gym.reward_profile.reward_profile()