Metric Results#

When a metric job completes, the platform automatically creates a metric job result — a persistent, queryable entity that captures the outcome of that run. Each result is created exactly once per completed job and shares the job’s name.

A result contains:

Aggregate scores: statistics (mean, min, max, std dev, count, NaN count, percentiles, etc.) computed across all dataset rows for each score produced by the metric.
References: the metric definition, dataset, and model used in the job, enabling cross-job filtering and comparison.

Results are stored independently of the job so you can list, filter, and compare them at any time without re-running evaluations. Use the SDK to:

aggregate_scores: view aggregated statistics for each score produced by the metric on the dataset.
row_scores: download scores computed per dataset row (JSONL).
metric_job_results: list, filter, and compare aggregated results across different evaluation jobs.

Get Aggregate Scores#

Retrieve aggregate score and statistics for the metric job:

aggregate = client.evaluation.metric_jobs.results.aggregate_scores.download(
    job="my-job-name",
)

for score in aggregate.scores:
    print(f"{score.name}: mean={score.mean} type={score.score_type}")

The metric aggregate payload contains:

scores[]: aggregated score objects
Range scores: score_type="range" with percentiles and histogram
Rubric scores: score_type="rubric" with rubric distribution and mode category
Common stats: count, nan_count, sum, mean, min, max, variance, std_dev

Get Row-Level Scores#

Retrieve per-row scores for deeper analysis:

row_scores = client.evaluation.metric_jobs.results.row_scores.download(
    job="my-job-name",
)

for row in row_scores:
    item = row.item  # Original dataset row
    sample = row.sample  # Model output payload (often empty for offline jobs)
    metrics = row.metrics  # Dict[str, List[MetricScore]]
    requests = row.requests  # Request/response logs captured during evaluation

    for metric_key, scores in metrics.items():
        for score in scores:
            print(f"{metric_key}:{score.name}={score.value}")

Each JSONL row contains:

item: input row
sample: sample output payload
metrics: row score values keyed by metric key
requests: captured request logs

List Results#

List all metric job results in a workspace:

results = client.evaluation.metric_job_results.list()

for result in results:
    print(f"{result.name}: {[s.name for s in result.scores]}")

Filter by metric, dataset, or model using the filter parameter to compare scores across models for a given metric, or to isolate results from a specific dataset or model. Filter values expect references with the workspace/name format.

results = client.evaluation.metric_job_results.list(
    filter={"metric": "my-workspace/my-metric"},
)

Select the score’s aggregate fields to return in the payload:

results = client.evaluation.metric_job_results.list(
    aggregate_fields=["mean"]
)

Sort results:

# Sort by creation time (newest first):
results = client.evaluation.metric_job_results.list(sort="-created_at")

Use the search parameter to do partial-match queries across result fields. Searchable fields are name, metric, dataset, model, and created_at. Bare string values default to a $like (substring) match. Use the JSON operator syntax for more control:

Operator	Meaning
`$eq`	Exact match
`$like`	Substring match
`$lt`, `$lte`, `$gt`, `$gte`	Numeric / date comparisons
`$in`, `$nin`	Match / exclude a list of values
`$and`, `$or`, `$not`	Logical operators

# Substring match on name:
results = client.evaluation.metric_job_results.list(search={"name": "llama"})

# Equivalent explicit form:
results = client.evaluation.metric_job_results.list(search={"name": {"$like": "llama"}})

# Results created after a specific date:
results = client.evaluation.metric_job_results.list(search={"created_at": {"$gt": "2025-01-01T00:00:00"}})

Pagination#

list() returns a paginated response. Iterating directly over the response automatically fetches subsequent pages:

for result in client.evaluation.metric_job_results.list():
    print(result.name)

Control page size and page number explicitly with the page and page_size parameters:

page = client.evaluation.metric_job_results.list(page=1, page_size=25)

print(f"Page {page.pagination.page} of {page.pagination.total_pages}")
print(f"{page.pagination.total_results} total results")

for result in page.data:
    print(result.name)

Manage Results#

Get a Specific Result#

Retrieve a single result by name:

result = client.evaluation.metric_job_results.retrieve("my-job-name")

Delete a Result#

A job result can be deleted. A deleted result will no longer appear when listing and grouping results for high level comparisons. The result could be redundant, misleading, or erroneous in some way. For example, a result with high NaN count can be considered a misleading result.

client.evaluation.metric_job_results.delete("my-job-name")

Parse Results#

The examples below require pandas.

uv add pandas

Aggregate Scores With Pandas#

import pandas as pd

aggregate = client.evaluation.metric_jobs.results.aggregate_scores.download(
    job="my-job-name",
)

df_agg = pd.DataFrame([score.to_dict() for score in aggregate.scores])
print(df_agg[["name", "score_type", "mean", "std_dev", "min", "max", "count"]])

Row Scores With Pandas#

import pandas as pd

row_scores = client.evaluation.metric_jobs.results.row_scores.download(
    job="my-job-name",
)

rows = []
for row in row_scores:
    flat_row = {**row.item}
    for metric_key, scores in row.metrics.items():
        for score in scores:
            flat_row[f"{metric_key}:{score.name}"] = score.value
    rows.append(flat_row)

df_rows = pd.DataFrame(rows)
print(df_rows)

Filter low-scoring rows:

score_cols = df_rows.select_dtypes(include="number").columns
low_scores = df_rows[df_rows[score_cols].min(axis=1) < 0.7]
print(f"Found {len(low_scores)} rows needing review")