Evaluation Benchmarks#
A benchmark is a reusable evaluation suite: one or more metrics paired with a dataset. Instead of redefining metrics and data inputs for every run, you define the benchmark once and run it repeatedly.
Use benchmarks when you want to:
Standardize model quality measurement across teams and releases
Run consistent regression checks after model, prompt, or pipeline updates
Compare multiple model versions using the same scoring criteria and dataset
Package validated metrics with domain-specific test data for repeatable evaluation
NeMo Platform provides two types of benchmarks:
Industry Benchmarks: Industry-standard academic benchmarks such as MMLU, HumanEval, and GSM8K for comparing model capabilities against published baselines
Custom Benchmarks: User-defined evaluation suites that combine your choice of metrics with domain-specific datasets
Custom benchmarks are valuable for domain-specific evaluation where standard benchmarks might not capture the nuances of your application, such as legal document analysis, medical terminology accuracy, or enterprise-specific terminology adherence.
Industry Benchmarks vs Custom Benchmarks#
Type |
Use Case |
Dataset |
Metrics |
|---|---|---|---|
Compare against published baselines, regression testing, model selection |
Canonical datasets (fixed) |
Standardized metrics |
|
Domain-specific evaluation, production monitoring, task-specific assessment |
Your evaluation data |
Your choice of metrics |
Discover Industry Benchmarks#
Discover industry benchmarks available to use for your evaluation job within the system workspace. List all industry benchmarks or filter by label category.
Note
The system workspace is a reserved workspace for NeMo Platform that contains ready-to-use benchmarks representing industry benchmarks with published datasets and metrics.
Note
Initialization: This example uses NeMoPlatform() with no arguments so the SDK reads your active CLI context (set by nmp auth login). The workspace="system" is passed per-call to access the reserved system workspace. For the standard quickstart init, see Initializing the CLI and SDK.
from nemo_platform import NeMoPlatform
client = NeMoPlatform()
benchmarks = client.evaluation.benchmarks.list(workspace="system")
print(f"{benchmarks.pagination.total_results} benchmarks")
for benchmark in benchmarks:
print(f"{benchmark.name}: {benchmark.description}")
# List benchmarks with pagination:
benchmarks = client.evaluation.benchmarks.list(
workspace="system",
page=2,
page_size=100
)
for benchmark in benchmarks:
print(f"{benchmark.name}: {benchmark.description}")
# Filter by evaluation category label
filtered_benchmarks = client.evaluation.benchmarks.list(
workspace="system",
extra_query={"search": {"data.labels.eval_category": {"$eq": "advanced_reasoning"}}},
)
print(filtered_benchmarks)
Category Label |
Description |
|---|---|
|
Evaluate the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning. |
|
Evaluate reasoning capabilities of large language models through complex tasks. |
|
Evaluate code generation capabilities using functional correctness benchmarks that test synthesis of working programs. |
|
Evaluate model safety risks including vulnerability to generate harmful, biased, or misleading content. |
|
Evaluate the ability to follow explicit formatting and structural instructions |
|
Evaluate knowledge and reasoning across diverse subjects in different languages. |
|
Evaluate mathematical reasoning abilities. |
|
Evaluate the ability to generate answers to questions. |
|
Evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. |
|
Evaluate the quality of document retriever pipelines. |
Create Custom Benchmarks#
Create a custom benchmark by combining metrics with your dataset. Before creating a benchmark, you will need to create the metrics that define how to score your model’s outputs.
benchmark = client.evaluation.benchmarks.create(
workspace="my-workspace",
name="my-qa-benchmark",
description="Evaluates question-answering quality",
metrics=["my-workspace/answer-relevancy", "my-workspace/faithfulness"],
dataset="my-workspace/qa-test-dataset",
labels={"my-label": "label-value"}, # optional user-input labels to apply to the benchmark
)
Refer to Manage Benchmarks for listing and managing custom benchmarks.
Run Benchmark Jobs#
Create a benchmark evaluation job to run the benchmark against your data.
Offline Job (Dataset Evaluation)#
Evaluate a pre-generated dataset:
from nemo_platform.types.evaluation import BenchmarkOfflineJobParam
job = client.evaluation.benchmark_jobs.create(
workspace="my-workspace",
spec=BenchmarkOfflineJobParam(
benchmark="my-workspace/my-qa-benchmark",
)
)
print(f"Job created: {job.name}")
Online Job (Model Evaluation)#
Evaluate a model directly by generating outputs during the benchmark:
from nemo_platform.types.evaluation import BenchmarkOnlineJobParam
job = client.evaluation.benchmark_jobs.create(
workspace="my-workspace",
spec=BenchmarkOnlineJobParam(
benchmark="system/mmlu-pro",
model={
"url": "<your-nim-url>/v1/completions",
"name": "meta/llama-3.1-8b-instruct"
}
)
)
print(f"Job created: {job.name}")
Manage Benchmarks#
List, retrieve, and delete evaluation benchmarks using the Python SDK. You can discover industry benchmarks in the system workspace, list custom benchmarks in your workspace, retrieve detailed benchmark configurations, and delete custom benchmarks when no longer needed.
Refer to Manage Benchmarks for complete SDK examples including pagination, sorting, filtering, and extended response options.
Job Management#
After successfully creating a job, refer to Benchmark Job Management to oversee its execution and monitor progress.
Benchmark Categories#
Compose a custom benchmark with a collection of metrics to evaluate tasks bespoke to your needs.
Evaluate agent workflows including tool calling, goal accuracy, topic adherence, and trajectory evaluation.
Ready-to-use benchmarks for reasoning, code generation, safety, and language understanding with published datasets.