Metric Results#
When a metric job completes, the platform automatically creates a metric job result — a persistent, queryable entity that captures the outcome of that run. Each result is created exactly once per completed job and shares the job’s name.
A result contains:
Aggregate scores: statistics (mean, min, max, std dev, count, NaN count, percentiles, etc.) computed across all dataset rows for each score produced by the metric.
References: the metric definition, dataset, and model used in the job, enabling cross-job filtering and comparison.
Results are stored independently of the job so you can list, filter, and compare them at any time without re-running evaluations. Use the SDK to:
aggregate_scores: view aggregated statistics for each score produced by the metric on the dataset.row_scores: download scores computed per dataset row (JSONL).metric_job_results: list, filter, and compare aggregated results across different evaluation jobs.
Get Aggregate Scores#
Retrieve aggregate score and statistics for the metric job:
aggregate = client.evaluation.metric_jobs.results.aggregate_scores.download(
job="my-job-name",
)
for score in aggregate.scores:
print(f"{score.name}: mean={score.mean} type={score.score_type}")
The metric aggregate payload contains:
scores[]: aggregated score objectsRange scores:
score_type="range"with percentiles and histogramRubric scores:
score_type="rubric"with rubric distribution and mode categoryCommon stats:
count,nan_count,sum,mean,min,max,variance,std_dev
Get Row-Level Scores#
Retrieve per-row scores for deeper analysis:
row_scores = client.evaluation.metric_jobs.results.row_scores.download(
job="my-job-name",
)
for row in row_scores:
item = row.item # Original dataset row
sample = row.sample # Model output payload (often empty for offline jobs)
metrics = row.metrics # Dict[str, List[MetricScore]]
requests = row.requests # Request/response logs captured during evaluation
for metric_key, scores in metrics.items():
for score in scores:
print(f"{metric_key}:{score.name}={score.value}")
Each JSONL row contains:
item: input rowsample: sample output payloadmetrics: row score values keyed by metric keyrequests: captured request logs
List Results#
List all metric job results in a workspace:
results = client.evaluation.metric_job_results.list()
for result in results:
print(f"{result.name}: {[s.name for s in result.scores]}")
Filter by metric, dataset, or model using the filter parameter to compare scores across models for a given metric, or to isolate results from a specific dataset or model. Filter values expect references with the workspace/name format.
results = client.evaluation.metric_job_results.list(
filter={"metric": "my-workspace/my-metric"},
)
Select the score’s aggregate fields to return in the payload:
results = client.evaluation.metric_job_results.list(
aggregate_fields=["mean"]
)
Sort results:
# Sort by creation time (newest first):
results = client.evaluation.metric_job_results.list(sort="-created_at")
Use the search parameter to do partial-match queries across result fields. Searchable fields are name, metric, dataset, model, and created_at. Bare string values default to a $like (substring) match. Use the JSON operator syntax for more control:
Operator |
Meaning |
|---|---|
|
Exact match |
|
Substring match |
|
Numeric / date comparisons |
|
Match / exclude a list of values |
|
Logical operators |
# Substring match on name:
results = client.evaluation.metric_job_results.list(search={"name": "llama"})
# Equivalent explicit form:
results = client.evaluation.metric_job_results.list(search={"name": {"$like": "llama"}})
# Results created after a specific date:
results = client.evaluation.metric_job_results.list(search={"created_at": {"$gt": "2025-01-01T00:00:00"}})
Pagination#
list() returns a paginated response. Iterating directly over the response automatically fetches subsequent pages:
for result in client.evaluation.metric_job_results.list():
print(result.name)
Control page size and page number explicitly with the page and page_size parameters:
page = client.evaluation.metric_job_results.list(page=1, page_size=25)
print(f"Page {page.pagination.page} of {page.pagination.total_pages}")
print(f"{page.pagination.total_results} total results")
for result in page.data:
print(result.name)
Manage Results#
Get a Specific Result#
Retrieve a single result by name:
result = client.evaluation.metric_job_results.retrieve("my-job-name")
Delete a Result#
A job result can be deleted. A deleted result will no longer appear when listing and grouping results for high level comparisons. The result could be redundant, misleading, or erroneous in some way. For example, a result with high NaN count can be considered a misleading result.
client.evaluation.metric_job_results.delete("my-job-name")
Parse Results#
The examples below require pandas.
uv add pandas
Aggregate Scores With Pandas#
import pandas as pd
aggregate = client.evaluation.metric_jobs.results.aggregate_scores.download(
job="my-job-name",
)
df_agg = pd.DataFrame([score.to_dict() for score in aggregate.scores])
print(df_agg[["name", "score_type", "mean", "std_dev", "min", "max", "count"]])
Row Scores With Pandas#
import pandas as pd
row_scores = client.evaluation.metric_jobs.results.row_scores.download(
job="my-job-name",
)
rows = []
for row in row_scores:
flat_row = {**row.item}
for metric_key, scores in row.metrics.items():
for score in scores:
flat_row[f"{metric_key}:{score.name}"] = score.value
rows.append(flat_row)
df_rows = pd.DataFrame(rows)
print(df_rows)
Filter low-scoring rows:
score_cols = df_rows.select_dtypes(include="number").columns
low_scores = df_rows[df_rows[score_cols].min(axis=1) < 0.7]
print(f"Found {len(low_scores)} rows needing review")