Benchmark Analysis

Benchmark analysis skills are reusable instructions that help agents interpret evaluation artifacts consistently. After collecting rollouts, an analysis skill turns raw artifacts into structured reports — explaining what changed, which tasks failed, and what intervention is most likely to improve the next run.

NeMo Gym supports skills in two ways:

Skills evaluation: Compare agent performance across different skill sets by varying skills.path between runs.
Using skills in model evaluation: The skills folder in Gym contains built-in skills that can be added to any Gym run.

BLADE Analysis

BLADE (Benchmark Level Analysis and Diagnostics Engine) is NeMo Gym’s built-in benchmark analysis skill. It is a post-run workflow — it does not run the benchmark or replace the verifier. Instead, it reads rollout artifacts produced by gym eval run and answers the question: why did the score change?

Use BLADE when you need to go beyond a headline score. It shows which tasks failed, what the dominant failure modes are, and the intervention most likely to close the gap — harness work, training, verifier repair, or prompt change.

The BLADE analysis skill turns NeMo Gym rollout artifacts into evidence-backed benchmark reports, model comparisons, and benchmark-improvement recommendations. Its workflow starts from rollout JSONL, aggregate metrics, reward profiles, configs, model names, repeat counts, and sampling settings. It computes pass@1/pass@k, separates always-pass, sometimes-pass, never-pass, and missing rows, reads trajectories in order, assigns root-cause labels, and maps findings to concrete interventions.

BLADE-ready benchmark packages use three deliverables:

Deliverable	Purpose
Analysis skill	Teaches an agent how to analyze the benchmark’s rollout data.
Rollout data	Provides comparable runs for analysis and calibration.
Golden report package	Provides curated reports, metric sidecars, and anchor facts for review.

For public Gym users, the in-repo BLADE helper provides local validation and calibration utilities when external BLADE infrastructure is unavailable:

$ uv run python .codex/skills/nemo-gym-blade-analysis/scripts/blade_toolkit.py validate \
>   --benchmark-dir benchmarks/<benchmark_name> --phase all

Inputs

Skills usually need the same evidence preserved for any evaluation run:

Rollout JSONL
Aggregate metrics JSON
Reward profile JSONL
Model, harness, resources server, and sampling configs
Model names or checkpoint references
Repeat counts and task limits
Notes on missing rows, timeouts, flaky infrastructure, or verifier behavior

Outputs

A useful skill output connects metrics to task-level evidence. It should identify what changed, which examples support the conclusion, and what intervention is most likely to improve the next run.

Common outputs include:

Score-change explanations
Task-level failure taxonomies
Benchmark health reports
Model comparison summaries
Recommended data, verifier, prompt, harness, or training interventions

Aggregate Metrics

Understand the scorecard skills usually start from.

Evaluation

Review the full evaluation workflow.