NeMo Skills Integration#
Use NeMo Skills benchmarks as SkillsEnvironment instances with full per-request observability.
Setup#
pip install -e ".[skills]" # or: pip install nemo-skills
ns prepare_data gpqa
ns prepare_data aime24
CLI#
nel eval run --bench skills://gpqa --repeats 4 --max-problems 100
nel eval run --bench skills://aime24 --repeats 8
nel eval run --bench skills://mmlu --repeats 1
The skills:// URI scheme resolves to a SkillsEnvironment, which:
Loads the dataset via
nemo_skills.dataset.utilsAuto-prepares missing datasets
Selects the correct scoring method from the benchmark’s
METRICS_TYPE
How It Works#
flowchart LR
NEL["nel eval run --bench skills://gpqa"] --> REG["Registry"]
REG --> SE["SkillsEnvironment"]
SE --> DS["nemo_skills dataset"]
SE --> EVAL["nemo_skills evaluator"]
NEL --> SOLVER["Solver"]
SOLVER --> MC["ModelClient"]
NEL --> OBS["Full trajectories<br/>latency, tokens, failures"]
SkillsEnvironment is a standard EvalEnvironment. The eval loop calls seed(idx) and verify(response, expected) like any other environment. NEL owns the model call, so you get full observability.
sequenceDiagram
participant E as eval_loop
participant S as SkillsEnvironment
participant NS as nemo_skills
participant M as Model
E->>S: seed(idx)
S->>NS: dataset[idx]
S-->>E: SeedResult(prompt, expected)
E->>M: solver.solve(task)
M-->>E: SolveResult
E->>S: verify(response, expected)
S->>NS: evaluator.eval_single()
S-->>E: VerifyResult(reward, details)
Scoring#
The environment automatically selects the correct scoring method based on METRICS_TYPE:
Metrics type |
Scoring |
Benchmarks |
|---|---|---|
|
Symbolic comparison via |
GSM8K, AIME, MATH, HMMT |
|
Letter extraction + exact match |
GPQA, MMLU, MMLU-Pro, ARC |
|
Exact match |
SimpleQA, TriviaQA |
|
Code execution sandbox |
LiveCodeBench, EvalPlus |
|
LLM judge |
Arena, AlpacaEval |
Custom Prompt Templates#
Override the default prompt for any Skills benchmark:
nel eval run --bench skills://gpqa --system-prompt "Think step by step."
Or via Python:
from nemo_evaluator.environments.skills import SkillsEnvironment
env = SkillsEnvironment(
"gpqa",
prompt_template="Answer the following question.\n\n{problem}\n\nAnswer:",
)
Available Benchmarks#
nel list # shows built-in + skills if installed
Math Reasoning#
Benchmark |
Skills name |
Type |
Problems |
|---|---|---|---|
AIME 2024 |
|
math |
~30 |
AIME 2025 |
|
math |
~30 |
HMMT Feb 2025 |
|
math |
~30 |
GSM8K |
|
math |
1,319 |
MATH |
|
math |
5,000 |
Knowledge and Reasoning#
Benchmark |
Skills name |
Type |
Problems |
|---|---|---|---|
GPQA Diamond |
|
multichoice |
~198 |
MMLU |
|
multichoice |
~14,000 |
MMLU-Pro |
|
multichoice |
~12,000 |
HLE |
|
varies |
~500 |
SimpleQA |
|
simpleqa |
~4,000 |
Code#
Benchmark |
Skills name |
Type |
Problems |
|---|---|---|---|
LiveCodeBench |
|
code_metrics |
varies |
SciCode |
|
code_metrics |
varies |
SWE-bench |
|
code_metrics |
varies |
Instruction Following#
Benchmark |
Skills name |
Type |
Problems |
|---|---|---|---|
IFBench |
|
varies |
varies |
IFEval |
|
varies |
varies |
Tool Calling#
Benchmark |
Skills name |
Type |
Problems |
|---|---|---|---|
BFCL v3 |
|
varies |
varies |
BFCL v4 |
|
varies |
varies |
Long Context#
Benchmark |
Skills name |
Type |
Problems |
|---|---|---|---|
RULER |
|
varies |
varies |
AA-LCR |
|
varies |
varies |
Distributed Evaluation#
Combine with sharding for large benchmarks:
Use SLURM sharding via a config file (see Distributed Evaluation), or use environment variables directly:
NEL_SHARD_IDX=0 NEL_TOTAL_SHARDS=8 nel eval run --bench skills://mmlu --repeats 4
Python API#
import asyncio
from nemo_evaluator.environments.skills import SkillsEnvironment
from nemo_evaluator import run_evaluation, ChatSolver, ModelClient
env = SkillsEnvironment("gpqa")
client = ModelClient(base_url="https://api.example.com/v1", model="my-model")
solver = ChatSolver(client)
bundle = asyncio.run(run_evaluation(env, solver, n_repeats=4, max_problems=100))