API Reference#
Core Types#
SeedResult#
Returned by EvalEnvironment.seed().
Field |
Type |
Description |
|---|---|---|
|
|
The prompt to send to the model |
|
|
Ground-truth answer for verification |
|
|
Arbitrary metadata (category, difficulty, aliases) |
VerifyResult#
Returned by EvalEnvironment.verify().
Field |
Type |
Description |
|---|---|---|
|
|
Score (0.0 = wrong, 1.0 = correct) |
|
|
What was extracted from the model response |
|
|
Method used, intermediate values |
|
|
Additional verification metadata |
EvalEnvironment#
Abstract base class for all benchmarks.
from nemo_evaluator.environments import EvalEnvironment, SeedResult, VerifyResult, register
@register("my_benchmark")
class MyBenchmark(EvalEnvironment):
def __init__(self):
super().__init__()
self._dataset = [...]
def __len__(self) -> int:
return len(self._dataset)
async def seed(self, idx: int) -> SeedResult:
...
async def verify(self, response: str, expected: str, **metadata) -> VerifyResult:
...
Runner#
ModelClient#
Async client for OpenAI-compatible endpoints.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
API base URL |
|
|
|
Model identifier |
|
|
|
API key |
|
|
|
Sampling temperature (0.0-2.0) |
|
|
|
Max completion tokens |
|
|
|
Nucleus sampling threshold (0.0-1.0) |
|
|
|
RNG seed for reproducibility |
|
|
|
Stop sequences |
|
|
|
Frequency penalty (-2.0 to 2.0) |
|
|
|
Presence penalty (-2.0 to 2.0) |
|
|
|
Request timeout in seconds |
|
|
|
Max parallel requests |
from nemo_evaluator.engine import ModelClient
client = ModelClient(
base_url="https://integrate.api.nvidia.com/v1",
model="azure/openai/gpt-5.2",
api_key="sk-...",
max_concurrent=16,
)
response = await client.chat("What is 2+2?", system="Answer briefly.")
run_evaluation()#
Core evaluation loop.
async def run_evaluation(
env: EvalEnvironment,
solver: Solver,
n_repeats: int = 1,
max_problems: int | None = None,
config: dict[str, Any] | None = None,
progress: ProgressTracker | None = None,
problem_range: tuple[int, int] | None = None,
max_concurrent: int = 32,
judge_client: Any = None,
shard_info: tuple[int, int] | None = None,
) -> dict[str, Any]:
Parameter |
Description |
|---|---|
|
|
|
|
|
Number of times to evaluate each problem |
|
Limit to first N problems |
|
Metadata included in the output bundle |
|
Progress tracker (default: no-op) |
|
|
|
Max parallel solve tasks (default: 32) |
|
Optional |
|
|
Returns a bundle dict containing metrics, results, config, and artifacts.
write_all()#
Write all artifacts to disk.
from nemo_evaluator.engine import write_all
write_all(bundle, "./eval_results")
Writes: eval-*.json (bundle), results.jsonl, trajectories.jsonl, runtime_stats.json, failure_analysis.json.
compare_runs()#
Compare two evaluation bundles for regression.
from nemo_evaluator.engine import compare_runs
report = compare_runs("baseline/eval-*.json", "candidate/eval-*.json")
Environment Integrations#
GymEnvironment#
HTTP client for consuming nel serve endpoints or Gym resource servers.
from nemo_evaluator.environments.gym import GymEnvironment
from nemo_evaluator.solvers import ChatSolver
env = GymEnvironment("http://localhost:9090")
solver = ChatSolver(client)
bundle = await run_evaluation(env, solver, n_repeats=4)
SkillsEnvironment#
Wraps any NeMo Skills benchmark as an EvalEnvironment.
from nemo_evaluator.environments.skills import SkillsEnvironment
env = SkillsEnvironment("gpqa")
solver = ChatSolver(client)
bundle = await run_evaluation(env, solver, n_repeats=4)
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
required |
Skills benchmark name (e.g., |
|
|
|
Dataset split (default from benchmark config) |
|
|
|
Override data directory |
|
|
|
Custom prompt template with |
|
|
|
Override scoring type (default from benchmark config) |
VLMEvalKitEnvironment#
Wraps VLMEvalKit datasets as an EvalEnvironment. Supports MCQ, VQA, and Y/N dataset types.
from nemo_evaluator.environments.vlmevalkit import VLMEvalKitEnvironment
env = VLMEvalKitEnvironment("MMBench_DEV_EN")
solver = VLMSolver(client)
bundle = await run_evaluation(env, solver, n_repeats=1)
generate_app()#
Serve an EvalEnvironment as a Gym-compatible HTTP endpoint.
from nemo_evaluator.serving.app import generate_app
app = generate_app(env, gym_compat=True)
# Use with uvicorn: uvicorn.run(app, port=9090)
Observability#
ModelResponse#
Field |
Type |
Description |
|---|---|---|
|
|
Model output text |
|
|
Model identifier |
|
|
stop, length, etc. |
|
|
Input tokens |
|
|
Output tokens |
|
|
Total tokens |
|
|
Reasoning tokens (if available) |
|
|
Request latency |
|
|
Full API response |
StepRecord#
Complete record for one seed → model → verify cycle.
Field |
Type |
Description |
|---|---|---|
|
|
Problem index |
|
|
Repeat index |
|
|
The prompt sent |
|
|
Ground truth |
|
|
Full model response |
|
|
Verification score |
|
|
Extracted from response |
|
|
Method used for scoring |
|
|
Scoring intermediaries |
|
|
Time to seed |
|
|
Time for model call |
|
|
Time to verify |
|
|
Total step time |
|
|
Error if model call failed |
CLI Commands#
Command |
Description |
|---|---|
|
Run evaluation (benchmark name or YAML config) |
|
Generate evaluation report |
|
Merge sharded evaluation results |
|
Start HTTP server for an environment |
|
Quick validation of a benchmark |
|
Show available benchmarks and environments |
|
Compare two evaluation bundles |
|
Apply a multi-benchmark gate policy |
|
Persistent user config |
|
Containerize BYOB benchmark |
nel eval run#
nel eval run [CONFIG_FILE]
--bench, -b TEXT Benchmark name
--repeats, -n INT Repeats per problem [1]
--max-problems, -m INT Limit problem count
--model-url TEXT Model API base URL
--model-id TEXT Model identifier
--system-prompt TEXT System prompt
--adapter TEXT Adapter URI (gym://host:port, skills://benchmark)
--output-dir, -o TEXT Output directory [./eval_results]
--resume Resume partially completed suite
-O, --override TEXT Dot-path config overrides (e.g. services.model.model=foo)
--no-progress Disable progress bar
nel serve#
nel serve
--bench, -b TEXT Benchmark to serve
--port INT Port [9090]
--host TEXT Host [0.0.0.0]
--gym-compat Enable Gym-compatible endpoints
--export-data TEXT Export JSONL to path and exit
nel validate#
nel validate
--bench, -b TEXT Benchmark to validate
--samples INT Number of samples [10]
nel list#
nel list
--source TEXT Filter by source (e.g., lm-eval)
nel eval report#
nel eval report RESULTS_DIR
--format, -f TEXT Output format (markdown, html, csv, json, latex)
--output, -o TEXT Output report path
--all-formats Generate all formats
nel eval merge#
nel eval merge OUTPUT_DIR
--repeats, -n INT Override n_repeats (auto-detected if omitted)
nel compare#
nel compare BASELINE CANDIDATE
--max-drop FLOAT Max acceptable drop [0.05]
--strict Exit non-zero on regression
--output TEXT Write JSON report
nel gate#
nel gate BASELINE_DIR CANDIDATE_DIR
--policy TEXT Gate policy YAML file [required]
--output TEXT Write JSON gate report
--format [text|json] Output format [text]
--strict Exit non-zero on NO-GO or INCONCLUSIVE
--verbose Show per-benchmark reasons