Industry Benchmarks#
Evaluate with Published Datasets#
NeMo Platform provides a streamlined API to evaluate large language models with publicly available datasets, offering over 130 industry benchmarks to run with evaluation jobs.
Benchmarks provide standardized methods for comparing model performance across different capabilities. These benchmarks are widely used in the research community and provide reliable, reproducible metrics for model assessment.
Refer to the Run a Benchmark Evaluation tutorial for details on using industry benchmarks and managing evaluation jobs.
Standard Datasets: Most benchmarks include predefined datasets widely used in research.
Reproducible Metrics: Use established methodologies to calculate metrics.
Community Standards: You can compare results across different models and research groups.
Discover Industry Benchmarks#
Discover industry benchmarks available to use for your evaluation job within the system workspace. List all industry benchmarks or filter by label category.
Note
The system workspace is a reserved workspace for NeMo Platform that contains ready-to-use benchmarks representing industry benchmarks with published datasets and metrics.
Note
Initialization: This example uses NeMoPlatform() with no arguments so the SDK reads your active CLI context (set by nmp auth login). The workspace="system" is passed per-call to access the reserved system workspace. For the standard quickstart init, see Initializing the CLI and SDK.
from nemo_platform import NeMoPlatform
client = NeMoPlatform()
benchmarks = client.evaluation.benchmarks.list(workspace="system")
print(f"{benchmarks.pagination.total_results} benchmarks")
for benchmark in benchmarks:
print(f"{benchmark.name}: {benchmark.description}")
# List benchmarks with pagination:
benchmarks = client.evaluation.benchmarks.list(
workspace="system",
page=2,
page_size=100
)
for benchmark in benchmarks:
print(f"{benchmark.name}: {benchmark.description}")
# Filter by evaluation category label
filtered_benchmarks = client.evaluation.benchmarks.list(
workspace="system",
extra_query={"search": {"data.labels.eval_category": {"$eq": "advanced_reasoning"}}},
)
print(filtered_benchmarks)
Category Label |
Description |
|---|---|
|
Evaluate the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning. |
|
Evaluate reasoning capabilities of large language models through complex tasks. |
|
Evaluate code generation capabilities using functional correctness benchmarks that test synthesis of working programs. |
|
Evaluate model safety risks including vulnerability to generate harmful, biased, or misleading content. |
|
Evaluate the ability to follow explicit formatting and structural instructions |
|
Evaluate knowledge and reasoning across diverse subjects in different languages. |
|
Evaluate mathematical reasoning abilities. |
|
Evaluate the ability to generate answers to questions. |
|
Evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. |
|
Evaluate the quality of document retriever pipelines. |
Choosing a Benchmark Variant#
Many benchmarks offer multiple variants optimized for different model types:
Variant |
Endpoint |
Description |
|---|---|---|
|
|
Zero-shot evaluation for instruction-tuned models |
Base (no suffix) |
|
Few-shot evaluation for base models (requires |
|
|
Optimized prompts for NVIDIA NeMo models; often no judge required |
|
|
Chain-of-thought prompting for improved reasoning accuracy |
Common Parameters#
Parameter |
Description |
|---|---|
|
Number of concurrent inference requests. Higher values increase throughput but can hit rate limits. |
|
Evaluate only the first N samples. Useful for testing before running full evaluations. |
|
Reference to a secret containing your Hugging Face token for accessing gated datasets. |
|
Hugging Face tokenizer ID, required for completions-based benchmarks. |
Tip
The model field in all benchmark examples below accepts either an inline model definition or a model reference string (for example, "my-workspace/my-model"). Refer to Model Configuration for details.
Advanced Reasoning#
Evaluate reasoning capabilities of large language models through complex tasks with datasets like GPQA, BIG-Bench Hard (BBH), or Multistep Soft Reasoning (MuSR).
Label:
eval_category.advanced_reasoning
Available Benchmarks#
To get the latest benchmarks available for your system, filter with the eval_category.advanced_reasoning label.
Benchmark |
Description |
Required Params |
|---|---|---|
|
GPQA Diamond subset—198 graduate-level science questions |
|
|
GPQA Extended subset—546 questions in biology, physics, chemistry |
|
|
GPQA Main subset—448 questions |
|
|
GPQA Diamond with NeMo alignment template |
|
|
GPQA Diamond with chain-of-thought prompting |
|
|
GPQA few-shot evaluation ¹ |
|
|
BIG-Bench Hard—23 challenging reasoning tasks |
|
|
BIG-Bench Hard ¹ |
|
|
MuSR—multistep reasoning through narrative problems ¹ |
|
¹ Completions-only: Requires /v1/completions endpoint and tokenizer parameter.
Examples#
import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="my-workspace", # Replace with your workspace or use "default"
)
job = client.evaluation.benchmark_jobs.create(
description="GPQA Diamond evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/gpqa-diamond",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
job = client.evaluation.benchmark_jobs.create(
description="GPQA Extended evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/gpqa-extended",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
job = client.evaluation.benchmark_jobs.create(
description="GPQA Main evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/gpqa-main",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
job = client.evaluation.benchmark_jobs.create(
description="GPQA Diamond evaluation with NeMo template",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/gpqa-diamond-nemo",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
job = client.evaluation.benchmark_jobs.create(
description="GPQA Diamond chain-of-thought evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/gpqa-diamond-cot",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
Requires a /v1/completions endpoint.
job = client.evaluation.benchmark_jobs.create(
description="GPQA few-shot evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/gpqa",
model={"url": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={
"hf_token": "hf_token",
"tokenizer": "<your-model-tokenizer>",
},
)
)
job = client.evaluation.benchmark_jobs.create(
description="BIG-Bench Hard evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/bbh-instruct",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
Requires a /v1/completions endpoint.
job = client.evaluation.benchmark_jobs.create(
description="BIG-Bench Hard evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/bbh",
model={"url": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={
"hf_token": "hf_token",
"tokenizer": "<your-model-tokenizer>",
},
)
)
Requires a /v1/completions endpoint.
job = client.evaluation.benchmark_jobs.create(
description="Multistep Soft Reasoning evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/musr",
model={"url": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={
"hf_token": "hf_token",
"tokenizer": "<your-model-tokenizer>",
},
)
)
Note
Most benchmarks require a Hugging Face token (hf_token) to access gated datasets. Create this secret before running evaluations:
import os
client.secrets.create(
workspace=workspace,
name="hf_token",
data=os.getenv("HF_TOKEN", "<your Hugging Face token>")
)
Results#
All advanced reasoning benchmarks produce an accuracy score (0.0–1.0) measuring the proportion of correct answers:
GPQA: Multiple-choice accuracy (random baseline = 25%)
BBH: Exact match accuracy across 23 reasoning tasks
MuSR: Accuracy on multistep reasoning narratives
# Get results after the job completes
aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
job=job.name,
)
# Print accuracy
for score in aggregate.scores:
print(f"{score.name}: {score.mean:.1%}")
For detailed results analysis, refer to Benchmark Results.
Instruction Following#
Evaluate a model’s ability to follow explicit formatting and structural instructions such as “include keyword x” or “use format y.”
Label:
instruction_following
Available Benchmarks#
Benchmark |
Description |
Required Params |
|---|---|---|
|
IFEval—500 prompts testing adherence to verifiable instructions |
|
Examples#
import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="my-workspace", # Replace with your workspace or use "default"
)
job = client.evaluation.benchmark_jobs.create(
description="IFEval instruction following evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/ifeval",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
Results#
IFEval produces multiple accuracy scores measuring instruction compliance:
Prompt-level accuracy: Percentage of prompts where all instructions were followed
Instruction-level accuracy: Percentage of individual instructions followed across all prompts
aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(job=job.name)
for score in aggregate.scores:
print(f"{score.name}: {score.mean:.1%}")
For detailed results analysis, refer to Benchmark Results.
Language Understanding#
Evaluate knowledge and reasoning across diverse subjects using MMLU (Massive Multitask Language Understanding) benchmarks covering 57 subjects across STEM, humanities, social sciences, and more.
Label:
language_understanding
Available Benchmarks#
Benchmark |
Description |
Required Params |
|---|---|---|
|
MMLU—57 subjects, few-shot evaluation ¹ |
|
|
MMLU zero-shot with single-letter response format |
|
|
MMLU-Pro—10 answer choices, more rigorous ¹ |
|
|
MMLU-Pro zero-shot with chat template |
|
|
MMLU-Redux—3,000 re-annotated questions ¹ |
|
|
MMLU-Redux zero-shot with chat template |
|
|
WikiLingua—cross-lingual summarization |
|
|
Global-MMLU in 30+ languages ² |
|
¹ Completions-only: Requires /v1/completions endpoint and tokenizer parameter.
² Supported languages: am, ar, bn, cs, de, el, en, es, fa, fil, fr, ha, he, hi, id, ig, it, ja, ko, ky, lt, mg, ms, ne, nl, ny, pl, pt, ro, ru, si, sn, so, sr, sv, sw, te, tr, uk, vi, yo
Examples#
import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="my-workspace", # Replace with your workspace or use "default"
)
job = client.evaluation.benchmark_jobs.create(
description="MMLU zero-shot evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/mmlu-instruct",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
Requires a /v1/completions endpoint.
job = client.evaluation.benchmark_jobs.create(
description="MMLU few-shot evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/mmlu",
model={"url": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={
"hf_token": "hf_token",
"tokenizer": "<your-model-tokenizer>",
},
)
)
job = client.evaluation.benchmark_jobs.create(
description="MMLU-Pro zero-shot evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/mmlu-pro-instruct",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
Requires a /v1/completions endpoint.
job = client.evaluation.benchmark_jobs.create(
description="MMLU-Pro few-shot evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/mmlu-pro",
model={"url": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={
"hf_token": "hf_token",
"tokenizer": "<your-model-tokenizer>",
},
)
)
job = client.evaluation.benchmark_jobs.create(
description="MMLU-Redux zero-shot evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/mmlu-redux-instruct",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
job = client.evaluation.benchmark_jobs.create(
description="Global-MMLU Spanish evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/mmlu-es",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
job = client.evaluation.benchmark_jobs.create(
description="WikiLingua cross-lingual summarization",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/wikilingua",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
Results#
Language understanding benchmarks produce accuracy scores:
MMLU/MMLU-Pro/MMLU-Redux: Multiple-choice accuracy across subjects (random baseline = 25% for MMLU, 10% for MMLU-Pro)
WikiLingua: ROUGE scores for summarization quality
aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(job=job.name)
for score in aggregate.scores:
print(f"{score.name}: {score.mean:.1%}")
For detailed results analysis, refer to Benchmark Results.
Math & Reasoning#
Evaluate mathematical reasoning abilities from grade school arithmetic to competition-level mathematics.
Label:
math
Available Benchmarks#
Benchmark |
Description |
Required Params |
|---|---|---|
|
GSM8K—1,319 grade school math problems ¹ |
|
|
GSM8K with chain-of-thought zero-shot |
|
|
MGSM—multilingual math (10 languages) ¹ |
|
|
MGSM with chain-of-thought prompting |
|
|
AIME 2024—competition math ² |
|
|
AIME 2025—competition math ² |
|
|
AIME 2024 with NeMo template |
— |
|
AIME 2025 with NeMo template |
— |
|
MATH test set (500 problems) ² |
|
|
MATH test with NeMo template |
— |
|
AIME 2024 (Artificial Analysis setup) ² |
|
|
MATH test (Artificial Analysis setup) ² |
|
¹ Completions-only: Requires /v1/completions endpoint and tokenizer parameter.
² Judge required: Requires a judge model to evaluate free-form math responses.
Important
For math benchmarks requiring a judge, use a model with strong instruction-following capabilities (70B+ parameters recommended). Smaller models may produce malformed judge outputs.
Examples#
import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="my-workspace", # Replace with your workspace or use "default"
)
job = client.evaluation.benchmark_jobs.create(
description="GSM8K chain-of-thought evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/gsm8k-cot-instruct",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
Requires a /v1/completions endpoint.
job = client.evaluation.benchmark_jobs.create(
description="GSM8K few-shot evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/gsm8k",
model={"url": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={
"hf_token": "hf_token",
"tokenizer": "<your-model-tokenizer>",
},
)
)
job = client.evaluation.benchmark_jobs.create(
description="MGSM multilingual math evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/mgsm-cot",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
No judge required with NeMo template.
job = client.evaluation.benchmark_jobs.create(
description="AIME 2025 competition math",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/aime-2025-nemo",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
)
)
Requires a judge model.
job = client.evaluation.benchmark_jobs.create(
description="AIME 2025 competition math with judge",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/aime-2025",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={
"judge": {
"model": {
"url": "<your-nim-endpoint>/v1",
"name": "nvidia/llama-3.3-nemotron-super-49b-v1",
}
}
},
)
)
No judge required with NeMo template.
job = client.evaluation.benchmark_jobs.create(
description="MATH test set evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/math-test-500-nemo",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
)
)
Results#
Math benchmarks produce accuracy scores:
GSM8K/MGSM: Exact match accuracy on final numerical answers
AIME/MATH: Correctness as judged by the judge model (or exact match for NeMo variants)
aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(job=job.name)
for score in aggregate.scores:
print(f"{score.name}: {score.mean:.1%}")
For detailed results analysis, refer to Benchmark Results.
Content Safety#
Evaluate model safety risks including vulnerability to generate harmful, biased, or misleading content.
Label:
content_safety
Available Benchmarks#
Benchmark |
Description |
Required Params |
|---|---|---|
|
AEGIS 2.0—12 hazard categories using Nemotron Safety Guard |
|
|
WildGuard—privacy, misinformation, harmful language |
|
Important
Safety benchmarks require specific judge models deployed with a /v1/completions endpoint (not chat/completions):
Benchmark |
Required Judge Model |
Model ID |
|---|---|---|
|
|
|
|
|
The benchmarks can take up to 1-3 hours.
Examples#
import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="my-workspace", # Replace with your workspace or use "default"
)
Requires Llama Nemotron Safety Guard V2 judge.
job = client.evaluation.benchmark_jobs.create(
description="AEGIS-v2 content safety evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/aegis-v2",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={
"hf_token": "hf_token",
"judge": {
"model": {
"url": "<your-safety-guard-endpoint>/v1/completions",
"name": "nvidia/llama-3.1-nemoguard-8b-content-safety",
}
}
},
)
)
Requires WildGuard judge.
job = client.evaluation.benchmark_jobs.create(
description="WildGuard content safety evaluation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/wildguard",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={
"hf_token": "hf_token",
"judge": {
"model": {
"url": "<your-wildguard-endpoint>/v1/completions",
"name": "allenai/wildguard",
}
}
},
)
)
Results#
Safety benchmarks produce category-level safety rates:
AEGIS-v2: Safe/unsafe classification across 12 hazard categories (violence, hate, sexual content, and so on)
WildGuard: Safe/unsafe classification for privacy, misinformation, harmful language, malicious use
aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(job=job.name)
for score in aggregate.scores:
print(f"{score.name}: {score.mean:.1%}")
For detailed results analysis, refer to Benchmark Results.
Troubleshooting Content Safety Benchmarks#
Refer to Troubleshooting NeMo Evaluator for general troubleshooting steps for failed evaluation jobs.
This section covers common issues for the safety harness.
Hugging Face Error#
Evaluations with safety harness requires Hugging Face access to the respective dataset and model tokenizer. If your job fails with the following errors, visit https://huggingface.co/ and log in to request access to the dataset or model.
datasets.exceptions.DatasetNotFoundError: Dataset 'allenai/wildguardmix' is a gated dataset on the Hub. Visit the dataset page at https://huggingface.co/datasets/allenai/wildguardmix to ask for access.
GatedRepoError: 403 Client Error.
Cannot access gated repo for url https://huggingface.co/<model>/resolve/main/tokenizer_config.json.
Your request to access model <model> is awaiting a review from the repo authors.
Incompatible Judge Model#
Using an unsupported judge model results in a job error. The aegis-v2 evaluation requires Llama Nemotron Safety Guard V2 judge and wildguard evaluation requires allenai/wildguard judge. KeyError is an example error for the wrong judge model like the following error.
Metrics calculated
Evaluation Metrics
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Safety Category ┃ Average Count ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ ERROR │ 5.0 │
└─────────────────┴───────────────┘
...
Subprocess finished with return code: 0
{'ERROR': 5.0}
Traceback (most recent call last):
...
"/usr/local/lib/python3.10/site-packages/core_evals/safety_eval/__init__.py", line 14, in parse_output
return parse_output(output_dir)
File "/usr/local/lib/python3.10/site-packages/core_evals/safety_eval/output.py", line 16, in parse_output
safety_rate = data['safe'] / sum(data.values())
KeyError: 'safe'
Unexpected Reasoning Traces#
Safety evaluations do not support reasoning traces and may result in the job error below.
ERROR There are at least 2 MUT (model under test) responses that start with <think>. Reasoning traces should not be evaluated. Exiting.
If the model or judge outputs reasoning traces like <think>reasoning context</think>answer, configure the job to only include the the final answer after the reasoning token (e.g. </think>) with reasoning.end_token. Consider specifying inference.max_tokens to a reasonable limit for the model’s chain of thought to conclude with the expected reasoning end token in order for the reasoning context to be properly omitted for evaluation.
Additionally, if you are encountering this error, it could be caused by the model exceeding its token limit resulting in the full response being consumed by the model thinking. These results can be dropped by setting the reasoning.include_if_not_finished parameter.
from nemo_platform.types.evaluation import EvaluationJobParams, InferenceParams, ReasoningParams
params = EvaluationJobParams(
inference=InferenceParams(max_tokens=1024),
reasoning=ReasoningParams(
end_token="</think>",
include_if_not_finished=False,
)
)
Code#
Evaluate code generation capabilities using functional correctness benchmarks that test synthesis of working programs.
Label:
code
Available Benchmarks#
Benchmark |
Description |
Required Params |
|---|---|---|
|
HumanEval—164 Python problems ¹ |
— |
|
HumanEval for instruction-tuned models |
— |
|
HumanEval+—80x more test cases ¹ |
— |
|
MBPP—Python programming problems |
— |
|
MBPP+—35x more test cases |
— |
|
MBPP+ with NeMo template |
— |
|
MultiPL-E—HumanEval in 20+ languages ¹ |
— |
¹ Completions-only: Requires /v1/completions endpoint.
MultiPL-E languages: clj (Clojure), cpp (C++), cs (C#), d, elixir, go, hs (Haskell), java, jl (Julia), js (JavaScript), lua, ml (OCaml), php, pl (Perl), r, rb (Ruby), rkt (Racket), rs (Rust), scala, sh (Bash), swift
The benchmarks can take up to 1–5 hours.
Examples#
import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="my-workspace", # Replace with your workspace or use "default"
)
job = client.evaluation.benchmark_jobs.create(
description="HumanEval code generation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/humaneval-instruct",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
)
)
Requires a /v1/completions endpoint.
job = client.evaluation.benchmark_jobs.create(
description="HumanEval code generation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/humaneval",
model={"url": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
params=EvaluationJobParams(parallelism=16),
)
)
Extended test suite with 80x more test cases.
job = client.evaluation.benchmark_jobs.create(
description="HumanEval+ code generation",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/humanevalplus",
model={"url": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
params=EvaluationJobParams(parallelism=16),
)
)
job = client.evaluation.benchmark_jobs.create(
description="MBPP+ Python programming",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/mbppplus-nemo",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
)
)
job = client.evaluation.benchmark_jobs.create(
description="MBPP+ Python programming",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/mbppplus",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
)
)
Requires a /v1/completions endpoint.
job = client.evaluation.benchmark_jobs.create(
description="MultiPL-E JavaScript",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/multiple-js",
model={"url": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
params=EvaluationJobParams(parallelism=16),
)
)
Requires a /v1/completions endpoint.
job = client.evaluation.benchmark_jobs.create(
description="MultiPL-E Rust",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/multiple-rs",
model={"url": "<your-nim-endpoint>/v1/completions", "name": "<your-base-model>"},
params=EvaluationJobParams(parallelism=16),
)
)
Results#
Code benchmarks produce pass@k metrics measuring functional correctness:
pass@1: Percentage of problems solved with one attempt
pass@10: Percentage of problems solved within 10 attempts (when
n_samples> 1)
aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(job=job.name)
for score in aggregate.scores:
print(f"{score.name}: {score.mean:.1%}")
For detailed results analysis, refer to Benchmark Results.
Tip
Want to experiment first? You can try these benchmarks using the open-source NeMo Evaluator SDK before deploying the platform. The SDK provides a lightweight way to test evaluation workflows locally.
Run Benchmark Job#
Create a workspace if you have not already.
workspace = "my-workspace"
client.workspaces.create(name=workspace)
Create an evaluation job with a benchmark that satisfy the required and optional parameters.
Note
For benchmarks that require a Hugging Face token or other API keys for external services, create the secret to be referenced by the job.
Most benchmarks require a Hugging Face token (hf_token) to access gated datasets. Create this secret before running evaluations:
import os
client.secrets.create(
workspace=workspace,
name="hf_token",
data=os.getenv("HF_TOKEN", "<your Hugging Face token>")
)
from nemo_platform.types.evaluation import SystemBenchmarkOnlineJobParam, EvaluationJobParams
job = client.evaluation.benchmark_jobs.create(
description="Example running system benchmark to evaluate my model's advanced reasoning capabilities.",
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/gpqa-diamond-cot",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
params=EvaluationJobParams(parallelism=16),
benchmark_params={"hf_token": "hf_token"},
)
)
Job Management#
After creating a job, navigate to Benchmark Job Management to oversee its execution and monitor progress.