Benchmark Results#

When a benchmark job completes, the platform automatically creates a benchmark job result — a persistent, queryable entity that captures the outcome of that run. Each result is created exactly once per completed job and shares the job’s name.

A result contains:

Per-metric results: for each metric in the benchmark, an entry with aggregate scores (mean, min, max, std dev, count, NaN count, etc.) computed across all dataset rows.
References: the benchmark definition, dataset, and model used in the job, enabling cross-job filtering and comparison.

Results are stored independently of the job so you can list, filter, and compare them at any time without re-running evaluations. Use the SDK to:

aggregate_scores: view aggregated statistics for each metric in the benchmark.
row_scores: download scores computed per dataset row (JSONL).
benchmark_job_results: list, filter, and compare aggregated results across different evaluation jobs.

Get Aggregate Scores#

Retrieve aggregate score and statistics for all metrics in your benchmark evaluation:

aggregate = client.evaluation.benchmark_jobs.results.aggregate_scores.download(
    job="my-job-name",
)

for score in aggregate.scores:
    print(f"{score.name}: mean={score.mean} type={score.score_type}")

The benchmark aggregate payload contains:

results[].scores[]: aggregated score objects across all metrics in the benchmark
Range scores: score_type="range" with percentiles and histogram
Rubric scores: score_type="rubric" with rubric distribution and mode category
Common stats: count, nan_count, sum, mean, min, max, variance, std_dev

Get Row-Level Scores#

Retrieve per-row scores for deeper analysis:

row_scores = client.evaluation.benchmark_jobs.results.row_scores.download(
    job="my-job-name",
)

# Iterate through rows (JSONL format)
for row in row_scores:
    item = row.item  # Original dataset row
    sample = row.sample  # Model output payload (often empty for offline jobs)
    metrics = row.metrics  # Dict[str, List[MetricScore]] keyed by metric reference
    requests = row.requests  # Request/response logs captured during evaluation

    for metric_ref, scores in metrics.items():
        for score in scores:
            print(f"{metric_ref}:{score.name}={score.value}")

Each JSONL row contains:

item: The original dataset row
sample: Model output payload
metrics: Scores for each metric, keyed by metric reference
requests: Captured request logs

List Results#

List all benchmark job results in a workspace:

for benchmark_job in client.evaluation.benchmark_job_results.list():
    for metric_result in benchmark_job.results:
        print(f"{benchmark_job.name} / {metric_result.metric}: {[s.name for s in metric_result.scores]}")

Unlike metric results, each benchmark result contains a results list — one entry per metric in the benchmark — where each entry holds the aggregate scores for that metric.

Filter by benchmark, dataset, or model using the filter parameter to compare scores across models for a given benchmark, or to isolate results from a specific dataset or model. Filter values expect references with the workspace/name format.

results = client.evaluation.benchmark_job_results.list(
    filter={"benchmark": "my-workspace/my-benchmark"},
)

Select the score’s aggregate fields to return in the payload:

results = client.evaluation.benchmark_job_results.list(
    aggregate_fields=["mean"]
)

Sort results:

# Sort by creation time (newest first):
results = client.evaluation.benchmark_job_results.list(sort="-created_at")

Use the search parameter to do partial-match queries across result fields. Searchable fields are name, benchmark, dataset, model, and created_at. Bare string values default to a $like (substring) match. Use the JSON operator syntax for more control:

Operator	Meaning
`$eq`	Exact match
`$like`	Substring match
`$lt`, `$lte`, `$gt`, `$gte`	Numeric / date comparisons
`$in`, `$nin`	Match / exclude a list of values
`$and`, `$or`, `$not`	Logical operators

# Substring match on name:
results = client.evaluation.benchmark_job_results.list(search={"name": "llama"})

# Equivalent explicit form:
results = client.evaluation.benchmark_job_results.list(search={"name": {"$like": "llama"}})

# Results created after a specific date:
results = client.evaluation.benchmark_job_results.list(search={"created_at": {"$gt": "2025-01-01T00:00:00"}})

Pagination#

list() returns a paginated response. Iterating directly over the response automatically fetches subsequent pages:

for result in client.evaluation.benchmark_job_results.list():
    print(result.name)

Control page size and page number explicitly with the page and page_size parameters:

page = client.evaluation.benchmark_job_results.list(page=1, page_size=25)

print(f"Page {page.pagination.page} of {page.pagination.total_pages}")
print(f"{page.pagination.total_results} total results")

for result in page.data:
    print(result.name)

Manage Results#

Get a Specific Result#

Retrieve a single benchmark result by job name:

result = client.evaluation.benchmark_job_results.retrieve("my-job-name")

Delete a Result#

A job result can be deleted. A deleted result will no longer appear when listing and comparing results. A result may be redundant, misleading, or erroneous — for example, a result with a high NaN count across metrics may not be suitable for comparison.

client.evaluation.benchmark_job_results.delete("my-job-name")

Parse Results#

The examples below require pandas.

uv add pandas

Using Pandas for Analysis#

BenchmarkJobResult.results is a list of per-metric entries, each with a metric reference and its own scores list. Load aggregate scores into a single DataFrame:

import pandas as pd

aggregate = client.evaluation.benchmark_job_results.retrieve("my-job-name")

rows = []
for metric_result in aggregate.results:
    for score in metric_result.scores:
        row = score.to_dict()
        row["metric_ref"] = metric_result.metric
        rows.append(row)

df_agg = pd.DataFrame(rows)
print(df_agg[["metric_ref", "name", "mean", "std_dev", "min", "max", "count"]])

Example output (range scores):

                  metric_ref          name  mean  std_dev  min  max  count
0   my-workspace/exact-match   exact-match   0.6     0.49  0.0  1.0      5
1  my-workspace/string-check  string-check   0.8     0.40  0.0  1.0      5

Load row-level scores for detailed analysis:

import pandas as pd

row_scores = client.evaluation.benchmark_jobs.results.row_scores.download(
    job="my-job-name",
)

# Flatten row scores into a DataFrame
rows = []
for row in row_scores:
    flat_row = {**row.item}
    for metric_ref, scores in row.metrics.items():
        for score in scores:
            flat_row[f"{metric_ref}:{score.name}"] = score.value
    rows.append(flat_row)

df_rows = pd.DataFrame(rows)
print(df_rows)

Example output:

                input                output reference  exact-match  string-check
      What is 2+2?       The answer is 4         4          0.0           1.0
Capital of France?  Paris is the capital     Paris          0.0           1.0
     Color of sky?                  Blue      blue          1.0           0.0
   Largest planet?               Jupiter   Jupiter          1.0           1.0
    Water formula?                   H2O       H2O          1.0           1.0

Identify low-scoring samples for review:

# Find rows where any score is below threshold
score_cols = df_rows.select_dtypes(include="number").columns
low_scores = df_rows[df_rows[score_cols].min(axis=1) < 0.7]
print(f"Found {len(low_scores)} samples needing review")