NeMo Skills Integration#

Use NeMo Skills benchmarks as SkillsEnvironment instances with full per-request observability.

Setup#

pip install -e ".[skills]"  # or: pip install nemo-skills
ns prepare_data gpqa
ns prepare_data aime24

CLI#

nel eval run --bench skills://gpqa --repeats 4 --max-problems 100
nel eval run --bench skills://aime24 --repeats 8
nel eval run --bench skills://mmlu --repeats 1

The skills:// URI scheme resolves to a SkillsEnvironment, which:

Loads the dataset via nemo_skills.dataset.utils
Auto-prepares missing datasets
Selects the correct scoring method from the benchmark’s METRICS_TYPE

How It Works#

        flowchart LR
    NEL["nel eval run --bench skills://gpqa"] --> REG["Registry"]
    REG --> SE["SkillsEnvironment"]
    SE --> DS["nemo_skills dataset"]
    SE --> EVAL["nemo_skills evaluator"]
    NEL --> SOLVER["Solver"]
    SOLVER --> MC["ModelClient"]
    NEL --> OBS["Full trajectories<br/>latency, tokens, failures"]

SkillsEnvironment is a standard EvalEnvironment. The eval loop calls seed(idx) and verify(response, expected) like any other environment. NEL owns the model call, so you get full observability.

        sequenceDiagram
    participant E as eval_loop
    participant S as SkillsEnvironment
    participant NS as nemo_skills
    participant M as Model

    E->>S: seed(idx)
    S->>NS: dataset[idx]
    S-->>E: SeedResult(prompt, expected)

    E->>M: solver.solve(task)
    M-->>E: SolveResult

    E->>S: verify(response, expected)
    S->>NS: evaluator.eval_single()
    S-->>E: VerifyResult(reward, details)

Scoring#

The environment automatically selects the correct scoring method based on METRICS_TYPE:

Metrics type	Scoring	Benchmarks
`math`	Symbolic comparison via `math_equal` (sympy)	GSM8K, AIME, MATH, HMMT
`multichoice`	Letter extraction + exact match	GPQA, MMLU, MMLU-Pro, ARC
`simpleqa`	Exact match	SimpleQA, TriviaQA
`code_metrics`	Code execution sandbox	LiveCodeBench, EvalPlus
`answer-judgement`	LLM judge	Arena, AlpacaEval

Custom Prompt Templates#

Override the default prompt for any Skills benchmark:

nel eval run --bench skills://gpqa --system-prompt "Think step by step."

Or via Python:

from nemo_evaluator.environments.skills import SkillsEnvironment

env = SkillsEnvironment(
    "gpqa",
    prompt_template="Answer the following question.\n\n{problem}\n\nAnswer:",
)

Available Benchmarks#

nel list  # shows built-in + skills if installed

Math Reasoning#

Benchmark	Skills name	Type	Problems
AIME 2024	`aime24`	math	~30
AIME 2025	`aime25`	math	~30
HMMT Feb 2025	`hmmt_feb2025`	math	~30
GSM8K	`gsm8k`	math	1,319
MATH	`math`	math	5,000

Knowledge and Reasoning#

Benchmark	Skills name	Type	Problems
GPQA Diamond	`gpqa`	multichoice	~198
MMLU	`mmlu`	multichoice	~14,000
MMLU-Pro	`mmlu_pro`	multichoice	~12,000
HLE	`hle`	varies	~500
SimpleQA	`simpleqa`	simpleqa	~4,000

Code#

Benchmark	Skills name	Type	Problems
LiveCodeBench	`livecodebench`	code_metrics	varies
SciCode	`scicode`	code_metrics	varies
SWE-bench	`swe_bench`	code_metrics	varies

Instruction Following#

Benchmark	Skills name	Type	Problems
IFBench	`ifbench`	varies	varies
IFEval	`ifeval`	varies	varies

Tool Calling#

Benchmark	Skills name	Type	Problems
BFCL v3	`bfcl_v3`	varies	varies
BFCL v4	`bfcl_v4`	varies	varies

Long Context#

Benchmark	Skills name	Type	Problems
RULER	`ruler`	varies	varies
AA-LCR	`aa_lcr`	varies	varies

Distributed Evaluation#

Combine with sharding for large benchmarks:

Use SLURM sharding via a config file (see Distributed Evaluation), or use environment variables directly:

NEL_SHARD_IDX=0 NEL_TOTAL_SHARDS=8 nel eval run --bench skills://mmlu --repeats 4

Python API#

import asyncio
from nemo_evaluator.environments.skills import SkillsEnvironment
from nemo_evaluator import run_evaluation, ChatSolver, ModelClient

env = SkillsEnvironment("gpqa")
client = ModelClient(base_url="https://api.example.com/v1", model="my-model")
solver = ChatSolver(client)

bundle = asyncio.run(run_evaluation(env, solver, n_repeats=4, max_problems=100))