Custom Benchmarks#
Custom benchmarks allow you to create reusable evaluation suites tailored to your specific use case. A benchmark combines one or more metrics with a dataset, enabling consistent evaluation across multiple models or pipeline versions.
Note
Custom benchmarks can only include custom metrics that you create in your workspace. System metrics (in the system workspace) cannot be included in custom benchmarks at this time. To use system metrics, refer to Industry Benchmarks.
Prerequisites#
Before creating a custom benchmark, ensure you have:
A workspace created for your project
One or more custom metrics defined in your workspace
A dataset uploaded as a fileset
import os
from nemo_platform import NeMoPlatform
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="default",
)
Tip
Set NMP_BASE_URL to your NeMo Evaluator deployment endpoint. See Initializing the CLI and SDK for the full convention.
Dataset Requirements#
Your dataset must be compatible with all metrics in the benchmark. Each metric defines input templates (like {{output}} and {{reference}}) that map to columns in your dataset.
For offline evaluation, your dataset should contain pre-generated model outputs as a JSON array:
[
{"input": "What is the capital of France?", "output": "Paris", "reference": "Paris"},
{"input": "What is 2+2?", "output": "4", "reference": "4"}
]
For online evaluation, your dataset contains inputs that will be sent to the model. The prompt_template you provide must reference columns in your dataset:
[
{"question": "What is the capital of France?", "expected_answer": "Paris"},
{"question": "What is 2+2?", "expected_answer": "4"}
]
Note
Upload your dataset to a fileset in any supported format. You can select specific files using a # fragment pattern (for example, my-workspace/my-fileset#data.csv). If no pattern is specified, all parsable files in the fileset are loaded.
Important
Ensure your dataset columns match both:
The input templates defined in your metrics (for example,
{{output}},{{reference}})The
prompt_templateused in online evaluation jobs (for example,{{question}})
Create a Custom Benchmark#
A benchmark requires:
name: Unique identifier within the workspace
description: Human-readable description of what the benchmark evaluates
metrics: List of metric references in
workspace/metric-nameformatdataset: Fileset reference in
workspace/fileset-nameformat
import os
from nemo_platform import NeMoPlatform
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="default",
)
benchmark = client.evaluation.benchmarks.create(
name="customer-support-quality",
description="Evaluates response quality for customer support conversations",
metrics=[
"my-workspace/answer-relevancy",
"my-workspace/response-helpfulness",
],
dataset="my-workspace/support-test-cases",
)
print(f"Created benchmark: {benchmark.name}")
Refer to Manage Benchmarks for listing and managing custom benchmarks.
Run Benchmark Evaluation Jobs#
After creating a benchmark, run evaluation jobs against it. There are two job types:
Job Type |
Use When |
Dataset Contains |
|---|---|---|
Offline |
You have pre-generated model outputs to evaluate |
Input, output, and reference columns |
Online |
You want to generate and evaluate responses in one job |
Input and reference columns (model generates output) |
Choose offline evaluation for:
Evaluating pre-generated model outputs (for example, from batch inference)
Comparing multiple model versions on the same test set
Choose online evaluation for:
Testing a model endpoint with live inference
End-to-end evaluation that generates and scores outputs in a single job
Offline Evaluation#
Offline evaluation assesses pre-existing model outputs stored in your dataset. Use this when you have already generated responses and want to evaluate their quality.
from nemo_platform.types.evaluation import BenchmarkOfflineJobParam
job = client.evaluation.benchmark_jobs.create(
spec=BenchmarkOfflineJobParam(
benchmark="my-workspace/customer-support-quality",
),
)
print(f"Job created: {job.name}")
print(f"Status: {job.status}")
With Execution Parameters#
Control job execution with optional parameters:
from nemo_platform.types.evaluation import (
BenchmarkOfflineJobParam,
EvaluationJobParamsParam,
)
job = client.evaluation.benchmark_jobs.create(
spec=BenchmarkOfflineJobParam(
benchmark="my-workspace/customer-support-quality",
params=EvaluationJobParamsParam(
parallelism=8,
limit_samples=100, # Evaluate only first 100 samples
),
),
)
Online Evaluation#
Online evaluation generates model responses at runtime, then evaluates them against your metrics. Use this to evaluate a model’s live performance.
The model field accepts either an inline model definition or a model reference. Refer to Model Configuration for details on both formats.
With Inline Model#
from nemo_platform.types.evaluation import (
BenchmarkOnlineJobParam,
ModelParam,
)
job = client.evaluation.benchmark_jobs.create(
spec=BenchmarkOnlineJobParam(
benchmark="my-workspace/customer-support-quality",
model=ModelParam(
url="https://integrate.api.nvidia.com/v1/chat/completions",
name="meta/llama-3.1-8b-instruct",
api_key_secret="nvidia-api-key",
),
prompt_template="Answer the following customer question:\n\n{{input}}",
),
)
With Model Reference#
from nemo_platform.types.evaluation import BenchmarkOnlineJobParam
job = client.evaluation.benchmark_jobs.create(
spec=BenchmarkOnlineJobParam(
benchmark="my-workspace/customer-support-quality",
model="my-workspace/llama-3-1-8b-instruct",
prompt_template="Answer the following customer question:\n\n{{input}}",
),
)
Job Management#
After successfully creating a job, navigate to Benchmark Job Management to oversee its execution and monitor progress.
Retrieve Results#
After the job completes, retrieve and analyze results. Refer to Benchmark Results for detailed examples of downloading aggregate scores, row-level scores, and analyzing results with Pandas.
Complete Example#
Here is a complete workflow for creating a benchmark and running an evaluation.
Note
This example assumes you have already created the metrics (exact-match, f1-score) and uploaded your dataset (qa-test-data). Refer to Evaluation Metrics for how to create custom metrics.
import json
import time
import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.evaluation import (
BenchmarkOfflineJobParam,
EvaluationJobParamsParam,
)
workspace = "default"
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace=workspace,
)
# 1. Create a custom benchmark
benchmark = client.evaluation.benchmarks.create(
name="qa-accuracy-benchmark",
description="Measures answer accuracy for Q&A tasks",
metrics=[f"{workspace}/exact-match", f"{workspace}/f1-score"],
dataset=f"{workspace}/qa-test-data",
)
print(f"Created benchmark: {benchmark.name}")
# 2. Run an offline evaluation job
job = client.evaluation.benchmark_jobs.create(
spec=BenchmarkOfflineJobParam(
benchmark=f"{workspace}/{benchmark.name}",
params=EvaluationJobParamsParam(parallelism=16),
),
)
print(f"Started job: {job.name}")
# 3. Wait for completion
status = client.evaluation.benchmark_jobs.get_status(name=job.name)
while status.status in ("pending", "active", "created"):
print(f"Status: {status.status}")
time.sleep(10)
status = client.evaluation.benchmark_jobs.get_status(name=job.name)
print(f"Job completed: {status.status}")
# 4. Get results
if status.status == "completed":
results = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
job=job.name,
)
print("Results:")
print(results.to_json(indent=2))
Troubleshooting#
Common Errors#
“Benchmark not found”
Verify the benchmark reference format:
workspace/benchmark-nameEnsure the benchmark was created in the correct workspace
Check that the benchmark was not deleted
“Metric not found”
Ensure all metrics referenced in the benchmark exist in your workspace
Remember that system metrics (
system/...) cannot be used in custom benchmarksVerify metric names match exactly (case-sensitive)
“Fileset not found”
Verify the dataset fileset was uploaded to the correct workspace
Check the fileset reference format:
workspace/fileset-nameEnsure at least one file was uploaded to the fileset
Job status “error”
Check job logs using
client.evaluation.benchmark_jobs.get_logs(name=job.name)for specific error messagesVerify your dataset columns match the metric input templates
For online jobs, verify the model endpoint is accessible and the API key is valid
Dataset column mismatch
Ensure your dataset contains all columns referenced by your metrics
For offline jobs: typically
input,output,referenceFor online jobs: columns referenced in your
prompt_template
Debugging Tips#
Start small: Test with
limit_samples=10to quickly identify issuesCheck logs: Always review job logs when a job fails
Validate dataset: Ensure your dataset files are valid and columns/keys are consistent across rows
Test metrics first: Run individual metric evaluations before combining into a benchmark
For additional help, refer to Troubleshooting.