Benchmark Results#
When a benchmark job completes, the platform automatically creates a benchmark job result — a persistent, queryable entity that captures the outcome of that run. Each result is created exactly once per completed job and shares the job’s name.
A result contains:
Per-metric results: for each metric in the benchmark, an entry with aggregate scores (mean, min, max, std dev, count, NaN count, etc.) computed across all dataset rows.
References: the benchmark definition, dataset, and model used in the job, enabling cross-job filtering and comparison.
Results are stored independently of the job so you can list, filter, and compare them at any time without re-running evaluations. Use the SDK to:
aggregate_scores: view aggregated statistics for each metric in the benchmark.row_scores: download scores computed per dataset row (JSONL).benchmark_job_results: list, filter, and compare aggregated results across different evaluation jobs.
Get Aggregate Scores#
Retrieve aggregate score and statistics for all metrics in your benchmark evaluation:
aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
job="my-job-name",
)
for score in aggregate.scores:
print(f"{score.name}: mean={score.mean} type={score.score_type}")
The benchmark aggregate payload contains:
results[].scores[]: aggregated score objects across all metrics in the benchmarkRange scores:
score_type="range"with percentiles and histogramRubric scores:
score_type="rubric"with rubric distribution and mode categoryCommon stats:
count,nan_count,sum,mean,min,max,variance,std_dev
Get Row-Level Scores#
Retrieve per-row scores for deeper analysis:
row_scores = client.evaluation.benchmark_jobs.results.row_scores.download(
job="my-job-name",
)
# Iterate through rows (JSONL format)
for row in row_scores:
item = row.item # Original dataset row
sample = row.sample # Model output payload (often empty for offline jobs)
metrics = row.metrics # Dict[str, List[MetricScore]] keyed by metric reference
requests = row.requests # Request/response logs captured during evaluation
for metric_ref, scores in metrics.items():
for score in scores:
print(f"{metric_ref}:{score.name}={score.value}")
Each JSONL row contains:
item: The original dataset rowsample: Model output payloadmetrics: Scores for each metric, keyed by metric referencerequests: Captured request logs
List Results#
List all benchmark job results in a workspace:
for benchmark_job in client.evaluation.benchmark_job_results.list():
for metric_result in benchmark_job.results:
print(f"{benchmark_job.name} / {metric_result.metric}: {[s.name for s in metric_result.scores]}")
Unlike metric results, each benchmark result contains a results list — one entry per metric in the benchmark — where each entry holds the aggregate scores for that metric.
Filter by benchmark, dataset, or model using the filter parameter to compare scores across models for a given benchmark, or to isolate results from a specific dataset or model. Filter values expect references with the workspace/name format.
results = client.evaluation.benchmark_job_results.list(
filter={"benchmark": "my-workspace/my-benchmark"},
)
Select the score’s aggregate fields to return in the payload:
results = client.evaluation.benchmark_job_results.list(
aggregate_fields=["mean"]
)
Sort results:
# Sort by creation time (newest first):
results = client.evaluation.benchmark_job_results.list(sort="-created_at")
Use the search parameter to do partial-match queries across result fields. Searchable fields are name, benchmark, dataset, model, and created_at. Bare string values default to a $like (substring) match. Use the JSON operator syntax for more control:
Operator |
Meaning |
|---|---|
|
Exact match |
|
Substring match |
|
Numeric / date comparisons |
|
Match / exclude a list of values |
|
Logical operators |
# Substring match on name:
results = client.evaluation.benchmark_job_results.list(search={"name": "llama"})
# Equivalent explicit form:
results = client.evaluation.benchmark_job_results.list(search={"name": {"$like": "llama"}})
# Results created after a specific date:
results = client.evaluation.benchmark_job_results.list(search={"created_at": {"$gt": "2025-01-01T00:00:00"}})
Pagination#
list() returns a paginated response. Iterating directly over the response automatically fetches subsequent pages:
for result in client.evaluation.benchmark_job_results.list():
print(result.name)
Control page size and page number explicitly with the page and page_size parameters:
page = client.evaluation.benchmark_job_results.list(page=1, page_size=25)
print(f"Page {page.pagination.page} of {page.pagination.total_pages}")
print(f"{page.pagination.total_results} total results")
for result in page.data:
print(result.name)
Manage Results#
Get a Specific Result#
Retrieve a single benchmark result by job name:
result = client.evaluation.benchmark_job_results.retrieve("my-job-name")
Delete a Result#
A job result can be deleted. A deleted result will no longer appear when listing and comparing results. A result may be redundant, misleading, or erroneous — for example, a result with a high NaN count across metrics may not be suitable for comparison.
client.evaluation.benchmark_job_results.delete("my-job-name")
Parse Results#
The examples below require pandas.
uv add pandas
Using Pandas for Analysis#
BenchmarkJobResult.results is a list of per-metric entries, each with a metric reference and its own scores list. Load aggregate scores into a single DataFrame:
import pandas as pd
aggregate = client.evaluation.benchmark_job_results.retrieve("my-job-name")
rows = []
for metric_result in aggregate.results:
for score in metric_result.scores:
row = score.to_dict()
row["metric_ref"] = metric_result.metric
rows.append(row)
df_agg = pd.DataFrame(rows)
print(df_agg[["metric_ref", "name", "mean", "std_dev", "min", "max", "count"]])
Example output (range scores):
metric_ref name mean std_dev min max count
0 my-workspace/exact-match exact-match 0.6 0.49 0.0 1.0 5
1 my-workspace/string-check string-check 0.8 0.40 0.0 1.0 5
Load row-level scores for detailed analysis:
import pandas as pd
row_scores = client.evaluation.benchmark_jobs.results.row_scores.download(
job="my-job-name",
)
# Flatten row scores into a DataFrame
rows = []
for row in row_scores:
flat_row = {**row.item}
for metric_ref, scores in row.metrics.items():
for score in scores:
flat_row[f"{metric_ref}:{score.name}"] = score.value
rows.append(flat_row)
df_rows = pd.DataFrame(rows)
print(df_rows)
Example output:
input output reference exact-match string-check
0 What is 2+2? The answer is 4 4 0.0 1.0
1 Capital of France? Paris is the capital Paris 0.0 1.0
2 Color of sky? Blue blue 1.0 0.0
3 Largest planet? Jupiter Jupiter 1.0 1.0
4 Water formula? H2O H2O 1.0 1.0
Identify low-scoring samples for review:
# Find rows where any score is below threshold
score_cols = df_rows.select_dtypes(include="number").columns
low_scores = df_rows[df_rows[score_cols].min(axis=1) < 0.7]
print(f"Found {len(low_scores)} samples needing review")