> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/gym/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/gym/_mcp/server.

# Benchmark Analysis

> Use BLADE and other analysis skills to interpret evaluation artifacts, diagnose benchmark changes, and understand why scores change.

Benchmark analysis skills are reusable instructions that help agents interpret evaluation artifacts consistently. After collecting rollouts, an analysis skill turns raw artifacts into structured reports — explaining what changed, which tasks failed, and what intervention is most likely to improve the next run.

NeMo Gym supports skills in two ways:

* **Skills evaluation**: Compare agent performance across different skill sets by varying `skills.path` between runs.
* **Using skills in model evaluation**: The [skills](https://github.com/NVIDIA-NeMo/Gym/tree/main/.codex/skills) folder in Gym contains built-in skills that can be added to any Gym run.

## BLADE Analysis

BLADE (**B**enchmark **L**evel **A**nalysis and **D**iagnostics **E**ngine) is NeMo Gym's built-in benchmark analysis skill. It is a post-run workflow — it does not run the benchmark or replace the verifier. Instead, it reads rollout artifacts produced by `gym eval run` and answers the question: **why did the score change?**

Use BLADE when you need to go beyond a headline score. It shows which tasks failed, what the dominant failure modes are, and the intervention most likely to close the gap — harness work, training, verifier repair, or prompt change.

The [BLADE analysis](https://github.com/NVIDIA-NeMo/Gym/tree/main/.codex/skills/nemo-gym-blade-analysis) skill turns NeMo Gym rollout artifacts into evidence-backed benchmark reports, model comparisons, and benchmark-improvement recommendations. Its workflow starts from rollout JSONL, aggregate metrics, reward profiles, configs, model names, repeat counts, and sampling settings. It computes pass\@1/pass\@k, separates always-pass, sometimes-pass, never-pass, and missing rows, reads trajectories in order, assigns root-cause labels, and maps findings to concrete interventions.

BLADE-ready benchmark packages use three deliverables:

| Deliverable           | Purpose                                                                 |
| --------------------- | ----------------------------------------------------------------------- |
| Analysis skill        | Teaches an agent how to analyze the benchmark's rollout data.           |
| Rollout data          | Provides comparable runs for analysis and calibration.                  |
| Golden report package | Provides curated reports, metric sidecars, and anchor facts for review. |

For public Gym users, the in-repo BLADE helper provides local validation and calibration utilities when external BLADE infrastructure is unavailable:

```bash
uv run python .codex/skills/nemo-gym-blade-analysis/scripts/blade_toolkit.py validate \
  --benchmark-dir benchmarks/<benchmark_name> --phase all
```

## Inputs

Skills usually need the same evidence preserved for any evaluation run:

* Rollout JSONL
* Aggregate metrics JSON
* Reward profile JSONL
* Model, harness, resources server, and sampling configs
* Model names or checkpoint references
* Repeat counts and task limits
* Notes on missing rows, timeouts, flaky infrastructure, or verifier behavior

## Outputs

A useful skill output connects metrics to task-level evidence. It should identify what changed, which examples support the conclusion, and what intervention is most likely to improve the next run.

Common outputs include:

* Score-change explanations
* Task-level failure taxonomies
* Benchmark health reports
* Model comparison summaries
* Recommended data, verifier, prompt, harness, or training interventions

Understand the scorecard skills usually start from.

Review the full evaluation workflow.