BLADE Analysis Skill
BLADE (Benchmark Level Analysis and Diagnostics Engine) is NeMo Gym’s built-in benchmark analysis skill. It turns raw rollout artifacts into evidence-backed benchmark reports and answers the question: why did the score change?
Given two or more scored runs, BLADE identifies which tasks failed and what the dominant failure modes are. It also maps the most likely intervention — harness work, training, verifier repair, prompt change — to close the gap. It works with any benchmark that produces NeMo Gym rollout JSONL.
BLADE is a post-run analysis workflow. It does not run the benchmark or replace the verifier. It reads rollout artifacts, aggregate metrics, reward profiles, configs, and report files produced by gym eval run.
What This Tutorial Builds
This tutorial builds a BLADE-ready package for one benchmark:
The bundled helper script validates this package shape and provides public fallback utilities when external BLADE infrastructure is unavailable.
The main helper commands are blade_toolkit.py validate, blade_toolkit.py extract-anchor-facts, blade_toolkit.py make-shallow, and blade_toolkit.py calibrate.
Prerequisites
Collect rollout artifacts for the benchmark you want to analyze. For a quick source of rollout data, first complete Evaluate EvalPlus, then repeat the run with a second model or harness.
A complete BLADE-ready package should include at least two scored rollout files with comparable task sets. One rollout file is enough to draft the skill and report, but full validation will fail the rollout-data check until there are two scored runs.
Create The Package Directory
Run from the repository root:
Use the real benchmark name for benchmark_name. The directory can point at an existing benchmark directory or a new package directory created for BLADE review.
Add Rollout Data
Copy comparable rollout files into rollouts/:
Each rollout row should include a task identifier, a rollout or repeat identifier when repeats exist, a reward or score field, the model or agent output, verifier output, and useful task metadata.
Keep the sidecar artifacts with the run notes even if they are not copied into rollouts/:
_aggregate_metrics.json- materialized inputs JSONL
- reward profile JSONL
- config files
- model names or checkpoint references
- sampling settings and repeat counts
Write The Analysis Skill
Create benchmarks/<benchmark_name>/skill/SKILL.md. This is a benchmark-specific skill, not a generic BLADE overview. It should explain what this benchmark measures and how to read its rollout data.
Include these sections:
- Overview: task shape, target capability, reward computation, and verifier behavior.
- Input schema: task id, rollout id, reward field, output fields, verifier fields, and useful slicing metadata.
- Workflow funnel: ordered stages such as started, attempted the core action, reached verifier, received feedback, and passed.
- Failure taxonomy: benchmark-specific labels and detection rules.
- Analysis workflow: deterministic metrics first, qualitative trajectory reading second, causal synthesis third.
- Report template: the markdown sections expected in a BLADE report.
The package validator checks for a real SKILL.md with YAML frontmatter and substantive content. Do not leave this file as a short placeholder.
Create A Golden Report
Write a curated report for one model:
A good golden report is not just metric tables. It should include:
- executive summary
- artifact inventory
- aggregate results
- workflow funnel
- task outcome buckets
- dominant failure modes
- sometimes-pass deep dives when repeats exist
- never-pass deep dives
- cross-model comparison when available
- recommendations
- reproducibility notes
Tie claims to task ids, rollout ids, verifier messages, logs, tool calls, or generated outputs. Mark missing rows, timeouts, malformed outputs, and redacted evidence explicitly.
Add Metrics Sidecar
Create a metrics sidecar next to the report:
Use the real metric values from the rollout and aggregate metrics artifacts. Add benchmark-specific fields such as consistency, coverage, pipeline counts, per-category breakdowns, or token statistics when they matter to the analysis.
Extract Anchor Facts
Anchor facts are concrete, non-guessable findings that the BLADE judge can use to check whether a candidate report found the same evidence-backed patterns as the golden report.
Use the local helper to draft them:
Review the generated anchor facts before using them. Remove weak facts, aggregate-only facts, or facts that do not require reading benchmark evidence.
Create A Shallow Baseline
A shallow baseline is a negative control. It looks like a report, but it should lack the causal diagnosis, examples, and recommendations that make the golden report useful.
Use this to check that calibration can distinguish a real diagnostic report from a script-style metric dump.
Validate The Package
Run validation across D1, D2, and D3:
Common validation failures are useful action items:
Calibrate Locally
After validation passes, run local proxy calibration:
The local proxy is a public fallback, not a replacement for official BLADE scoring when that infrastructure is available. Treat failures as review signals: weak anchor facts, shallow report too close to golden, missing evidence, or report sections that need more diagnostic detail.
Readiness Checklist
Before marking the benchmark BLADE-ready:
- D1 skill has overview, schema, taxonomy, workflow, and report template.
- D2 rollouts include at least two scored, comparable runs.
- D3 golden reports cite task-level evidence, not only aggregate tables.
- Metrics sidecars parse and match the report.
- Anchor facts exist for each golden report and are concrete.
- Shallow baselines exist for calibration.
- Sanitization removes private source, endpoints, credentials, user names, unreleased benchmark names, and raw data that is not cleared for sharing.