BLADE Analysis Skill

View as Markdown

BLADE (Benchmark Level Analysis and Diagnostics Engine) is NeMo Gym’s built-in benchmark analysis skill. It turns raw rollout artifacts into evidence-backed benchmark reports and answers the question: why did the score change?

Given two or more scored runs, BLADE identifies which tasks failed and what the dominant failure modes are. It also maps the most likely intervention — harness work, training, verifier repair, prompt change — to close the gap. It works with any benchmark that produces NeMo Gym rollout JSONL.

BLADE is a post-run analysis workflow. It does not run the benchmark or replace the verifier. It reads rollout artifacts, aggregate metrics, reward profiles, configs, and report files produced by gym eval run.

What This Tutorial Builds

This tutorial builds a BLADE-ready package for one benchmark:

IDDeliverablePurposeTypical path
D1Analysis skillTeaches an agent how to analyze this benchmark’s rollout data.benchmarks/<benchmark_name>/skill/SKILL.md
D2Rollout dataProvides comparable model or agent runs for analysis and calibration.benchmarks/<benchmark_name>/rollouts/*.jsonl
D3Golden report packageProvides curated reports, metric sidecars, shallow baselines, and anchor facts.benchmarks/<benchmark_name>/golden_reports/

The bundled helper script validates this package shape and provides public fallback utilities when external BLADE infrastructure is unavailable.

The main helper commands are blade_toolkit.py validate, blade_toolkit.py extract-anchor-facts, blade_toolkit.py make-shallow, and blade_toolkit.py calibrate.

Prerequisites

Collect rollout artifacts for the benchmark you want to analyze. For a quick source of rollout data, first complete Evaluate EvalPlus, then repeat the run with a second model or harness.

A complete BLADE-ready package should include at least two scored rollout files with comparable task sets. One rollout file is enough to draft the skill and report, but full validation will fail the rollout-data check until there are two scored runs.

Create The Package Directory

Run from the repository root:

$benchmark_name=evalplus
$benchmark_dir="benchmarks/${benchmark_name}"
$
$mkdir -p "${benchmark_dir}/skill"
$mkdir -p "${benchmark_dir}/rollouts"
$mkdir -p "${benchmark_dir}/golden_reports"

Use the real benchmark name for benchmark_name. The directory can point at an existing benchmark directory or a new package directory created for BLADE review.

Add Rollout Data

Copy comparable rollout files into rollouts/:

$cp results/evalplus_model_a_rollouts.jsonl \
> "${benchmark_dir}/rollouts/model_a_rollouts.jsonl"
$
$cp results/evalplus_model_b_rollouts.jsonl \
> "${benchmark_dir}/rollouts/model_b_rollouts.jsonl"

Each rollout row should include a task identifier, a rollout or repeat identifier when repeats exist, a reward or score field, the model or agent output, verifier output, and useful task metadata.

Keep the sidecar artifacts with the run notes even if they are not copied into rollouts/:

  • _aggregate_metrics.json
  • materialized inputs JSONL
  • reward profile JSONL
  • config files
  • model names or checkpoint references
  • sampling settings and repeat counts

Write The Analysis Skill

Create benchmarks/<benchmark_name>/skill/SKILL.md. This is a benchmark-specific skill, not a generic BLADE overview. It should explain what this benchmark measures and how to read its rollout data.

Include these sections:

  • Overview: task shape, target capability, reward computation, and verifier behavior.
  • Input schema: task id, rollout id, reward field, output fields, verifier fields, and useful slicing metadata.
  • Workflow funnel: ordered stages such as started, attempted the core action, reached verifier, received feedback, and passed.
  • Failure taxonomy: benchmark-specific labels and detection rules.
  • Analysis workflow: deterministic metrics first, qualitative trajectory reading second, causal synthesis third.
  • Report template: the markdown sections expected in a BLADE report.

The package validator checks for a real SKILL.md with YAML frontmatter and substantive content. Do not leave this file as a short placeholder.

Create A Golden Report

Write a curated report for one model:

$golden_report="${benchmark_dir}/golden_reports/model_a_golden_report.md"

A good golden report is not just metric tables. It should include:

  • executive summary
  • artifact inventory
  • aggregate results
  • workflow funnel
  • task outcome buckets
  • dominant failure modes
  • sometimes-pass deep dives when repeats exist
  • never-pass deep dives
  • cross-model comparison when available
  • recommendations
  • reproducibility notes

Tie claims to task ids, rollout ids, verifier messages, logs, tool calls, or generated outputs. Mark missing rows, timeouts, malformed outputs, and redacted evidence explicitly.

Add Metrics Sidecar

Create a metrics sidecar next to the report:

$cat > "${benchmark_dir}/golden_reports/model_a_golden_report_metrics.json" <<'JSON'
${
> "model_name": "model_a",
> "benchmark": "evalplus",
> "pass_at_1": 0.42,
> "pass_at_k": 0.58,
> "total_tasks": 164,
> "total_rollouts": 656
>}
$JSON

Use the real metric values from the rollout and aggregate metrics artifacts. Add benchmark-specific fields such as consistency, coverage, pipeline counts, per-category breakdowns, or token statistics when they matter to the analysis.

Extract Anchor Facts

Anchor facts are concrete, non-guessable findings that the BLADE judge can use to check whether a candidate report found the same evidence-backed patterns as the golden report.

Use the local helper to draft them:

$tool=".codex/skills/nemo-gym-blade-analysis/scripts/blade_toolkit.py"
$
$uv run python "${tool}" extract-anchor-facts \
> --golden "${benchmark_dir}/golden_reports/model_a_golden_report.md" \
> --benchmark "${benchmark_name}" \
> --model-name model_a \
> --output "${benchmark_dir}/golden_reports/model_a_anchor_facts.json"

Review the generated anchor facts before using them. Remove weak facts, aggregate-only facts, or facts that do not require reading benchmark evidence.

Create A Shallow Baseline

A shallow baseline is a negative control. It looks like a report, but it should lack the causal diagnosis, examples, and recommendations that make the golden report useful.

$uv run python "${tool}" make-shallow \
> --input "${benchmark_dir}/golden_reports/model_a_golden_report.md" \
> --output "${benchmark_dir}/golden_reports/model_a_shallow.md"

Use this to check that calibration can distinguish a real diagnostic report from a script-style metric dump.

Validate The Package

Run validation across D1, D2, and D3:

$uv run python "${tool}" validate \
> --benchmark-dir "${benchmark_dir}" \
> --phase all

Common validation failures are useful action items:

FailureMeaningFix
D1 skill missingNo benchmark-specific analysis skill was found.Add skill/SKILL.md or SKILL.md.
At least two scored rollout filesOnly one rollout file has reward or score.Add another comparable model or agent run.
task id field presentRows do not expose a stable task identifier.Preserve _ng_task_index, task_id, or an equivalent field.
missing metrics sidecarGolden report has no structured metric JSON.Add <model>_golden_report_metrics.json.
missing anchor factsGolden report has no judge anchor facts.Generate and review <model>_anchor_facts.json.

Calibrate Locally

After validation passes, run local proxy calibration:

$uv run python "${tool}" calibrate \
> --golden-report "${benchmark_dir}/golden_reports/model_a_golden_report.md" \
> --anchor-facts "${benchmark_dir}/golden_reports/model_a_anchor_facts.json" \
> --shallow-report "${benchmark_dir}/golden_reports/model_a_shallow.md"

The local proxy is a public fallback, not a replacement for official BLADE scoring when that infrastructure is available. Treat failures as review signals: weak anchor facts, shallow report too close to golden, missing evidence, or report sections that need more diagnostic detail.

Readiness Checklist

Before marking the benchmark BLADE-ready:

  • D1 skill has overview, schema, taxonomy, workflow, and report template.
  • D2 rollouts include at least two scored, comparable runs.
  • D3 golden reports cite task-level evidence, not only aggregate tables.
  • Metrics sidecars parse and match the report.
  • Anchor facts exist for each golden report and are concrete.
  • Shallow baselines exist for calibration.
  • Sanitization removes private source, endpoints, credentials, user names, unreleased benchmark names, and raw data that is not cleared for sharing.