Scoring#
Overview#
Each benchmark defines its own scorer via the @scorer decorator. The scoring/ package provides reusable scorer implementations:
Module |
Purpose |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Scoring Primitives#
These functions are the building blocks for @scorer functions. All are importable from the top-level package.
exact_match(sample)#
Normalized string comparison: lowercase, strip whitespace and punctuation, collapse articles.
from nemo_evaluator import exact_match, ScorerInput
s = ScorerInput(response=" Paris. ", target="paris")
exact_match(s) # {"correct": True}
multichoice_regex(sample)#
Extracts a letter (A-D by default) from “Answer: X” patterns.
from nemo_evaluator import multichoice_regex, ScorerInput
s = ScorerInput(response="The answer is B because...\nAnswer: B", target="B")
multichoice_regex(s) # {"correct": True, "extracted": "B"}
# Custom pattern for 10-choice (A-J):
multichoice_regex(s, pattern=r"(?i)Answer\s*:\s*([A-J])")
answer_line(sample)#
Extracts the text after “Answer:” and compares with math normalization.
from nemo_evaluator import answer_line, ScorerInput
s = ScorerInput(response="Step 1: ...\nAnswer: 42", target="42")
answer_line(s) # {"correct": True, "extracted": "42"}
numeric_match(sample)#
Extracts the last number in the response.
from nemo_evaluator import numeric_match, ScorerInput
s = ScorerInput(response="The total is 20 + 22 = 42", target="42")
numeric_match(s) # {"correct": True, "extracted": "42"}
fuzzy_match(sample)#
Normalized substring containment. Supports multiple correct answers via metadata["correct_answers"].
from nemo_evaluator import fuzzy_match, ScorerInput
s = ScorerInput(response="The capital is Canberra.", target="Canberra",
metadata={"correct_answers": ["Canberra", "canberra"]})
fuzzy_match(s) # {"correct": True, "extracted": "The capital is Canberra."}
code_sandbox(sample)#
Runs code in a Docker container with network isolation, memory limits, and timeouts. Extracts code from markdown fences, concatenates with prompt code and test harness, and checks the exit code.
from nemo_evaluator import code_sandbox, ScorerInput
s = ScorerInput(
response="```python\ndef add(a, b):\n return a + b\n```",
target="add",
metadata={"_prompt": "def add(a, b):\n", "_test": "assert add(1, 2) == 3",
"entry_point": "add"},
)
code_sandbox(s) # {"correct": True, "extracted": "def add(a, b):\n return a + b"}
Requires Docker daemon access.
needs_judge(sample)#
Signals that this sample requires LLM-as-judge scoring. Returns {"correct": False, "needs_judge": True} so the eval loop’s judge post-processor handles it.
Used by SimpleQA and HealthBench.
LLM-as-Judge Pipeline#
Benchmarks that use needs_judge() are scored in a post-processing step by scoring/judge.py. The judge pipeline:
Collects all samples flagged with
needs_judge: TrueConstructs judge prompts from the response and expected answer
Calls the judge model (configured via
--judge-url/JudgeScoringConfig)Parses the judge verdict and updates rewards
Configure the judge model:
nel eval run --bench simpleqa \
--model-url https://api.example.com/v1 --model-id my-model \
--judge-url https://api.example.com/v1 --judge-id gpt-4o
JSON Schema Scoring#
scoring/json_schema.py validates structured model outputs against a JSON schema:
from nemo_evaluator.scoring import validate_json_schema
result = validate_json_schema(response_text, schema={"type": "object", "required": ["answer"]})
# {"valid": True, "extracted": {"answer": "42"}, "score": 1.0}
Metrics#
pass@k#
Standard Codex-style pass@k. Given n attempts per problem with c correct:
$$\text{pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$$
from nemo_evaluator.metrics import pass_at_k
pass_at_k(n=8, c=3, k=1) # probability at least 1/1 is correct
pass_at_k(n=8, c=3, k=4) # probability at least 1/4 is correct
Bootstrap Confidence Intervals#
95% CI via bootstrap resampling (10,000 iterations):
from nemo_evaluator.metrics import bootstrap_ci
ci = bootstrap_ci(scores)
print(f"pass@1: {ci.value:.4f} [{ci.ci_lower:.4f}, {ci.ci_upper:.4f}]")
Category Breakdown#
When problems include category metadata, per-category accuracy is computed automatically:
from nemo_evaluator.metrics.aggregation import category_breakdown
cats = category_breakdown(results, "category")
for c in cats:
print(f"{c.category}: {c.mean_reward:.3f} ({c.n_samples} samples)")
Failure Analysis#
The ArtifactCollector categorizes failures automatically:
Category |
Detection |
|---|---|
|
Model error contains “timeout” |
|
Model error contains “429” or “rate” |
|
Response contains “I cannot”, “I’m sorry” |
|
Empty response or whitespace-only |
|
Non-empty response but no answer extracted |
Output in failure_analysis.json:
{
"total_failures": 12,
"failure_rate": 0.06,
"categories": {
"refusal": {"count": 5, "rate": 0.025},
"format_error": {"count": 4, "rate": 0.02},
"timeout": {"count": 3, "rate": 0.015}
},
"exemplars": [
{"category": "refusal", "problem_idx": 42, "response_preview": "I'm sorry, I cannot..."}
]
}