Quickstart#
Run a complete evaluation in under 5 minutes.
Step 1: Set your model endpoint#
export NVIDIA_API_KEY="your-api-key-here"
Step 2: Run an evaluation#
nel eval run --bench gsm8k \
--model-url https://integrate.api.nvidia.com/v1/chat/completions \
--model-id nvidia/nemotron-3-super-120b-a12b \
--repeats 2 --max-problems 20
Live progress output:
gsm8k [12/20] 60.0% ████████████░░░░░░░░ acc=0.750 1.2k tok/s ETA 0:32
Step 3: Read the results#
After completion, nel prints a summary and writes artifacts:
gsm8k
problems=20 repeats=2
pass@1: 0.7500 [0.5500, 0.9000]
pass@2: 0.8500 [0.6500, 0.9500]
tokens=12,400 speed=2.1/s
Output: ./eval_results/
bundle: eval-20260225T143012Z-gsm8k.json
results: results.jsonl
trajectories: trajectories.jsonl
runtime_stats: runtime_stats.json
failure_analysis: failure_analysis.json
Artifact reference#
File |
Contents |
Use case |
|---|---|---|
|
Full bundle: config, scores, CI, categories |
Regression comparison, model cards |
|
Per-problem: prompt, response, reward, extracted answer |
Debugging, error analysis |
|
Full step records: timing, tokens, scoring details |
RL pipeline input, audit |
|
Latency percentiles, token distributions, throughput |
Performance analysis |
|
Categorized failures with exemplars |
Model improvement |
Step 4: Compare against a baseline#
nel eval run --bench gsm8k --repeats 4 -o ./results/candidate
nel compare ./results/baseline ./results/candidate --strict
nel compare accepts directories or bundle files. It pairs results at the problem level, runs a McNemar exact test for regression detection, and generates an investigation report.
When scipy is installed (pip install nemo-evaluator[stats]), comparisons include McNemar significance testing and confidence intervals on effect sizes.
See Comparing Evaluation Runs for the full walkthrough.
Step 5: Gate a release across multiple benchmarks#
For release qualification, use nel gate with a policy file:
# gate_policy.yaml
version: 1
defaults:
tier: supporting
metric: mean_reward
max_drop: 0.015
benchmarks:
mmlu_pro:
tier: critical
max_drop: 0.01
gpqa:
tier: critical
max_drop: 0.01
nel gate ./results/baseline ./results/candidate --policy gate_policy.yaml --strict
The gate applies per-benchmark thresholds and returns GO, NO-GO, or INCONCLUSIVE. With --strict, exit codes work directly in CI (0/1/2).
See Implementing Quality Gates for the full walkthrough.
Step 6: Use a config file#
For reproducible multi-benchmark evaluations:
# eval_config.yaml
services:
nemotron:
type: api
url: https://integrate.api.nvidia.com/v1/chat/completions
protocol: chat_completions
model: nvidia/nemotron-3-super-120b-a12b
api_key: ${NVIDIA_API_KEY}
benchmarks:
- name: gsm8k
repeats: 4
solver:
type: simple
service: nemotron
system_prompt: "Solve step by step. Put your final answer in \\boxed{}."
- name: triviaqa
repeats: 1
max_problems: 100
solver:
type: simple
service: nemotron
nel eval run eval_config.yaml
Each task gets its own output directory with the full artifact suite.
Step 7: Resume a failed suite#
If a benchmark fails mid-suite (e.g., a network error on benchmark 3/5), the remaining benchmarks still execute. Re-run with --resume to retry only the failed ones:
nel eval run eval_config.yaml --resume
Completed benchmarks are skipped. Failed benchmarks are retried.
Next Steps#
Comparing Evaluation Runs – Deep dive into comparing runs and investigating regressions
Implementing Quality Gates – Set up multi-benchmark release gates with policy files
Write Your Own Benchmark (BYOB) – Write your own benchmark with
@benchmark+@scorerGym Integration – Serve benchmarks for Gym training
Distributed Evaluation – Scale to thousands of problems
Architecture – Understand how the system works