Local Deployment#
Run evaluations directly on your workstation.
Single benchmark (CLI)#
nel eval run --bench gsm8k --repeats 4 \
--model-url https://integrate.api.nvidia.com/v1/chat/completions \
--model-id nvidia/nemotron-3-super-120b-a12b \
--output-dir ./results
Config file#
# eval.yaml
services:
nemotron:
type: api
url: https://integrate.api.nvidia.com/v1/chat/completions
protocol: chat_completions
model: nvidia/nemotron-3-super-120b-a12b
api_key: ${NVIDIA_API_KEY}
benchmarks:
- name: gsm8k
repeats: 4
solver:
type: simple
service: nemotron
system_prompt: "Solve step by step. Put your final answer in \\boxed{}."
- name: triviaqa
repeats: 1
max_problems: 200
solver:
type: simple
service: nemotron
nel eval run eval.yaml
Multi-model evaluation#
Use named services to evaluate multiple models in the same config:
services:
solver:
type: api
url: https://integrate.api.nvidia.com/v1/chat/completions
protocol: chat_completions
model: nvidia/nemotron-3-super-120b-a12b
api_key: ${NVIDIA_API_KEY}
judge:
type: api
url: https://integrate.api.nvidia.com/v1/chat/completions
protocol: chat_completions
model: nvidia/nemotron-3-super-120b-a12b
api_key: ${NVIDIA_API_KEY}
benchmarks:
- name: simpleqa
solver:
type: simple
service: solver
scoring:
metrics:
- type: judge
name: correctness
service: judge
Resume a partially completed suite#
If a benchmark fails mid-suite, the remaining benchmarks still execute. Re-run with --resume to retry only the failed ones:
nel eval run eval.yaml --resume
Completed benchmarks are skipped automatically. Without --resume, all benchmarks are re-run from scratch.
Serve for Gym#
nel serve -b gsm8k --gym-compat --port 9090
Validate a benchmark#
Quick sanity check (10 samples, prints pass/fail per sample):
nel validate -b gsm8k --samples 10
Environment variables#
Variable |
Purpose |
Example |
|---|---|---|
|
API key for NVIDIA endpoints |
|
|
API key for OpenAI endpoints |
|
|
Shard index for distributed eval |
|
|
Total shards for distributed eval |
|