Evaluation Benchmarks#

A benchmark is a reusable evaluation suite: one or more metrics paired with a dataset. Instead of redefining metrics and data inputs for every run, you define the benchmark once and run it repeatedly.

Use benchmarks when you want to:

  • Standardize model quality measurement across teams and releases

  • Run consistent regression checks after model, prompt, or pipeline updates

  • Compare multiple model versions using the same scoring criteria and dataset

  • Package validated metrics with domain-specific test data for repeatable evaluation

NeMo Platform provides two types of benchmarks:

  • Industry Benchmarks: Industry-standard academic benchmarks such as MMLU, HumanEval, and GSM8K for comparing model capabilities against published baselines

  • Custom Benchmarks: User-defined evaluation suites that combine your choice of metrics with domain-specific datasets

Custom benchmarks are valuable for domain-specific evaluation where standard benchmarks might not capture the nuances of your application, such as legal document analysis, medical terminology accuracy, or enterprise-specific terminology adherence.

Industry Benchmarks vs Custom Benchmarks#

Type

Use Case

Dataset

Metrics

Industry Benchmarks

Compare against published baselines, regression testing, model selection

Canonical datasets (fixed)

Standardized metrics

Custom Benchmarks

Domain-specific evaluation, production monitoring, task-specific assessment

Your evaluation data

Your choice of metrics

Discover Industry Benchmarks#

Discover industry benchmarks available to use for your evaluation job within the system workspace. List all industry benchmarks or filter by label category.

Note

The system workspace is a reserved workspace for NeMo Platform that contains ready-to-use benchmarks representing industry benchmarks with published datasets and metrics.

Note

Initialization: This example uses NeMoPlatform() with no arguments so the SDK reads your active CLI context (set by nmp auth login). The workspace="system" is passed per-call to access the reserved system workspace. For the standard quickstart init, see Initializing the CLI and SDK.

from nemo_platform import NeMoPlatform

client = NeMoPlatform()

benchmarks = client.evaluation.benchmarks.list(workspace="system")

print(f"{benchmarks.pagination.total_results} benchmarks")
for benchmark in benchmarks:
    print(f"{benchmark.name}: {benchmark.description}")

# List benchmarks with pagination:
benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    page=2,
    page_size=100
)
for benchmark in benchmarks:
    print(f"{benchmark.name}: {benchmark.description}")

# Filter by evaluation category label
filtered_benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    extra_query={"search": {"data.labels.eval_category": {"$eq": "advanced_reasoning"}}},
)
print(filtered_benchmarks)

Category Label

Description

agentic

Evaluate the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning.

advanced_reasoning

Evaluate reasoning capabilities of large language models through complex tasks.

code

Evaluate code generation capabilities using functional correctness benchmarks that test synthesis of working programs.

content_safety

Evaluate model safety risks including vulnerability to generate harmful, biased, or misleading content.

instruction_following

Evaluate the ability to follow explicit formatting and structural instructions

language_understanding

Evaluate knowledge and reasoning across diverse subjects in different languages.

math

Evaluate mathematical reasoning abilities.

question_answering

Evaluate the ability to generate answers to questions.

rag

Evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance.

retrieval

Evaluate the quality of document retriever pipelines.

Create Custom Benchmarks#

Create a custom benchmark by combining metrics with your dataset. Before creating a benchmark, you will need to create the metrics that define how to score your model’s outputs.

benchmark = client.evaluation.benchmarks.create(
    workspace="my-workspace",
    name="my-qa-benchmark",
    description="Evaluates question-answering quality",
    metrics=["my-workspace/answer-relevancy", "my-workspace/faithfulness"],
    dataset="my-workspace/qa-test-dataset",
    labels={"my-label": "label-value"}, # optional user-input labels to apply to the benchmark
)

Refer to Manage Benchmarks for listing and managing custom benchmarks.

Run Benchmark Jobs#

Create a benchmark evaluation job to run the benchmark against your data.

Offline Job (Dataset Evaluation)#

Evaluate a pre-generated dataset:

from nemo_platform.types.evaluation import BenchmarkOfflineJobParam

job = client.evaluation.benchmark_jobs.create(
    workspace="my-workspace",
    spec=BenchmarkOfflineJobParam(
        benchmark="my-workspace/my-qa-benchmark",
    )
)

print(f"Job created: {job.name}")

Online Job (Model Evaluation)#

Evaluate a model directly by generating outputs during the benchmark:

from nemo_platform.types.evaluation import BenchmarkOnlineJobParam

job = client.evaluation.benchmark_jobs.create(
    workspace="my-workspace",
    spec=BenchmarkOnlineJobParam(
        benchmark="system/mmlu-pro",
        model={
            "url": "<your-nim-url>/v1/completions",
            "name": "meta/llama-3.1-8b-instruct"
        }
    )
)

print(f"Job created: {job.name}")

Manage Benchmarks#

List, retrieve, and delete evaluation benchmarks using the Python SDK. You can discover industry benchmarks in the system workspace, list custom benchmarks in your workspace, retrieve detailed benchmark configurations, and delete custom benchmarks when no longer needed.

Refer to Manage Benchmarks for complete SDK examples including pagination, sorting, filtering, and extended response options.

Job Management#

After successfully creating a job, refer to Benchmark Job Management to oversee its execution and monitor progress.

Benchmark Categories#

Custom Benchmarks

Compose a custom benchmark with a collection of metrics to evaluate tasks bespoke to your needs.

Custom Benchmarks
Agentic Benchmarks

Evaluate agent workflows including tool calling, goal accuracy, topic adherence, and trajectory evaluation.

Agentic Benchmarks
Industry Benchmarks

Ready-to-use benchmarks for reasoning, code generation, safety, and language understanding with published datasets.

Industry Benchmarks