Evaluation Benchmarks#

A benchmark is a reusable evaluation suite: one or more metrics paired with a dataset. Instead of redefining metrics and data inputs for every run, you define the benchmark once and run it repeatedly.

Use benchmarks when you want to:

Standardize model quality measurement across teams and releases
Run consistent regression checks after model, prompt, or pipeline updates
Compare multiple model versions using the same scoring criteria and dataset
Package validated metrics with domain-specific test data for repeatable evaluation

NeMo Platform provides two types of benchmarks:

Industry Benchmarks: Industry-standard academic benchmarks such as MMLU, HumanEval, and GSM8K for comparing model capabilities against published baselines
Custom Benchmarks: User-defined evaluation suites that combine your choice of metrics with domain-specific datasets

Custom benchmarks are valuable for domain-specific evaluation where standard benchmarks might not capture the nuances of your application, such as legal document analysis, medical terminology accuracy, or enterprise-specific terminology adherence.

Industry Benchmarks vs Custom Benchmarks#

Type	Use Case	Dataset	Metrics
Industry Benchmarks	Compare against published baselines, regression testing, model selection	Canonical datasets (fixed)	Standardized metrics
Custom Benchmarks	Domain-specific evaluation, production monitoring, task-specific assessment	Your evaluation data	Your choice of metrics

Discover Industry Benchmarks#

Discover industry benchmarks available to use for your evaluation job within the system workspace. List all industry benchmarks or filter by label category.

Note

The system workspace is a reserved workspace for NeMo Platform that contains ready-to-use benchmarks representing industry benchmarks with published datasets and metrics.

Note

Initialization: This example uses NeMoPlatform() with no arguments so the SDK reads your active CLI context (set by nmp auth login). The workspace="system" is passed per-call to access the reserved system workspace. For the standard quickstart init, see Initializing the CLI and SDK.

from nemo_platform import NeMoPlatform

client = NeMoPlatform()

benchmarks = client.evaluation.benchmarks.list(workspace="system")

print(f"{benchmarks.pagination.total_results} benchmarks")
for benchmark in benchmarks:
    print(f"{benchmark.name}: {benchmark.description}")

# List benchmarks with pagination:
benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    page=2,
    page_size=100
)
for benchmark in benchmarks:
    print(f"{benchmark.name}: {benchmark.description}")

# Filter by evaluation category label
filtered_benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    extra_query={"search": {"data.labels.eval_category": {"$eq": "advanced_reasoning"}}},
)
print(filtered_benchmarks)

Category Label	Description
`agentic`	Evaluate the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning.
`advanced_reasoning`	Evaluate reasoning capabilities of large language models through complex tasks.
`code`	Evaluate code generation capabilities using functional correctness benchmarks that test synthesis of working programs.
`content_safety`	Evaluate model safety risks including vulnerability to generate harmful, biased, or misleading content.
`instruction_following`	Evaluate the ability to follow explicit formatting and structural instructions
`language_understanding`	Evaluate knowledge and reasoning across diverse subjects in different languages.
`math`	Evaluate mathematical reasoning abilities.
`question_answering`	Evaluate the ability to generate answers to questions.
`rag`	Evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance.
`retrieval`	Evaluate the quality of document retriever pipelines.

Create Custom Benchmarks#

Create a custom benchmark by combining metrics with your dataset. Before creating a benchmark, you will need to create the metrics that define how to score your model’s outputs.

benchmark = client.evaluation.benchmarks.create(
    workspace="my-workspace",
    name="my-qa-benchmark",
    description="Evaluates question-answering quality",
    metrics=["my-workspace/answer-relevancy", "my-workspace/faithfulness"],
    dataset="my-workspace/qa-test-dataset",
    labels={"my-label": "label-value"}, # optional user-input labels to apply to the benchmark
)

Refer to Manage Benchmarks for listing and managing custom benchmarks.

Run Benchmark Jobs#

Create a benchmark evaluation job to run the benchmark against your data.

Offline Job (Dataset Evaluation)#

Evaluate a pre-generated dataset:

from nemo_platform.types.evaluation import BenchmarkOfflineJobParam

job = client.evaluation.benchmark_jobs.create(
    workspace="my-workspace",
    spec=BenchmarkOfflineJobParam(
        benchmark="my-workspace/my-qa-benchmark",
    )
)

print(f"Job created: {job.name}")

Online Job (Model Evaluation)#

Evaluate a model directly by generating outputs during the benchmark:

from nemo_platform.types.evaluation import BenchmarkOnlineJobParam

job = client.evaluation.benchmark_jobs.create(
    workspace="my-workspace",
    spec=BenchmarkOnlineJobParam(
        benchmark="system/mmlu-pro",
        model={
            "url": "<your-nim-url>/v1/completions",
            "name": "meta/llama-3.1-8b-instruct"
        }
    )
)

print(f"Job created: {job.name}")

Manage Benchmarks#

List, retrieve, and delete evaluation benchmarks using the Python SDK. You can discover industry benchmarks in the system workspace, list custom benchmarks in your workspace, retrieve detailed benchmark configurations, and delete custom benchmarks when no longer needed.

Refer to Manage Benchmarks for complete SDK examples including pagination, sorting, filtering, and extended response options.

Job Management#

After successfully creating a job, refer to Benchmark Job Management to oversee its execution and monitor progress.

Benchmark Categories#

Custom Benchmarks

Compose a custom benchmark with a collection of metrics to evaluate tasks bespoke to your needs.

RAGAS BFCL

Custom Benchmarks

Agentic Benchmarks

Evaluate agent workflows including tool calling, goal accuracy, topic adherence, and trajectory evaluation.

RAGAS BFCL

Agentic Benchmarks

Industry Benchmarks

Ready-to-use benchmarks for reasoning, code generation, safety, and language understanding with published datasets.

LM Harness BigCode

Industry Benchmarks