Benchmark Analysis
Benchmark analysis skills are reusable instructions that help agents interpret evaluation artifacts consistently. After collecting rollouts, an analysis skill turns raw artifacts into structured reports — explaining what changed, which tasks failed, and what intervention is most likely to improve the next run.
NeMo Gym supports skills in two ways:
- Skills evaluation: Compare agent performance across different skill sets by varying
skills.pathbetween runs. - Using skills in model evaluation: The skills folder in Gym contains built-in skills that can be added to any Gym run.
BLADE Analysis
BLADE (Benchmark Level Analysis and Diagnostics Engine) is NeMo Gym’s built-in benchmark analysis skill. It is a post-run workflow — it does not run the benchmark or replace the verifier. Instead, it reads rollout artifacts produced by gym eval run and answers the question: why did the score change?
Use BLADE when you need to go beyond a headline score. It shows which tasks failed, what the dominant failure modes are, and the intervention most likely to close the gap — harness work, training, verifier repair, or prompt change.
The BLADE analysis skill turns NeMo Gym rollout artifacts into evidence-backed benchmark reports, model comparisons, and benchmark-improvement recommendations. Its workflow starts from rollout JSONL, aggregate metrics, reward profiles, configs, model names, repeat counts, and sampling settings. It computes pass@1/pass@k, separates always-pass, sometimes-pass, never-pass, and missing rows, reads trajectories in order, assigns root-cause labels, and maps findings to concrete interventions.
BLADE-ready benchmark packages use three deliverables:
For public Gym users, the in-repo BLADE helper provides local validation and calibration utilities when external BLADE infrastructure is unavailable:
Inputs
Skills usually need the same evidence preserved for any evaluation run:
- Rollout JSONL
- Aggregate metrics JSON
- Reward profile JSONL
- Model, harness, resources server, and sampling configs
- Model names or checkpoint references
- Repeat counts and task limits
- Notes on missing rows, timeouts, flaky infrastructure, or verifier behavior
Outputs
A useful skill output connects metrics to task-level evidence. It should identify what changed, which examples support the conclusion, and what intervention is most likely to improve the next run.
Common outputs include:
- Score-change explanations
- Task-level failure taxonomies
- Benchmark health reports
- Model comparison summaries
- Recommended data, verifier, prompt, harness, or training interventions