Benchmark Analysis

View as Markdown

Benchmark analysis skills are reusable instructions that help agents interpret evaluation artifacts consistently. After collecting rollouts, an analysis skill turns raw artifacts into structured reports — explaining what changed, which tasks failed, and what intervention is most likely to improve the next run.

NeMo Gym supports skills in two ways:

  • Skills evaluation: Compare agent performance across different skill sets by varying skills.path between runs.
  • Using skills in model evaluation: The skills folder in Gym contains built-in skills that can be added to any Gym run.

BLADE Analysis

BLADE (Benchmark Level Analysis and Diagnostics Engine) is NeMo Gym’s built-in benchmark analysis skill. It is a post-run workflow — it does not run the benchmark or replace the verifier. Instead, it reads rollout artifacts produced by gym eval run and answers the question: why did the score change?

Use BLADE when you need to go beyond a headline score. It shows which tasks failed, what the dominant failure modes are, and the intervention most likely to close the gap — harness work, training, verifier repair, or prompt change.

The BLADE analysis skill turns NeMo Gym rollout artifacts into evidence-backed benchmark reports, model comparisons, and benchmark-improvement recommendations. Its workflow starts from rollout JSONL, aggregate metrics, reward profiles, configs, model names, repeat counts, and sampling settings. It computes pass@1/pass@k, separates always-pass, sometimes-pass, never-pass, and missing rows, reads trajectories in order, assigns root-cause labels, and maps findings to concrete interventions.

BLADE-ready benchmark packages use three deliverables:

DeliverablePurpose
Analysis skillTeaches an agent how to analyze the benchmark’s rollout data.
Rollout dataProvides comparable runs for analysis and calibration.
Golden report packageProvides curated reports, metric sidecars, and anchor facts for review.

For public Gym users, the in-repo BLADE helper provides local validation and calibration utilities when external BLADE infrastructure is unavailable:

$uv run python .codex/skills/nemo-gym-blade-analysis/scripts/blade_toolkit.py validate \
> --benchmark-dir benchmarks/<benchmark_name> --phase all

Inputs

Skills usually need the same evidence preserved for any evaluation run:

  • Rollout JSONL
  • Aggregate metrics JSON
  • Reward profile JSONL
  • Model, harness, resources server, and sampling configs
  • Model names or checkpoint references
  • Repeat counts and task limits
  • Notes on missing rows, timeouts, flaky infrastructure, or verifier behavior

Outputs

A useful skill output connects metrics to task-level evidence. It should identify what changed, which examples support the conclusion, and what intervention is most likely to improve the next run.

Common outputs include:

  • Score-change explanations
  • Task-level failure taxonomies
  • Benchmark health reports
  • Model comparison summaries
  • Recommended data, verifier, prompt, harness, or training interventions