Manage Benchmarks#

List, retrieve, and delete evaluation benchmarks using the NeMo Platform Python SDK.

import os
from nemo_platform import NeMoPlatform

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

List Benchmarks#

List all available evaluation benchmarks in a workspace, including both industry benchmarks and custom user-defined benchmarks.

Basic Usage#

# List all benchmarks in the current workspace
benchmarks = client.evaluation.benchmarks.list()
for benchmark in benchmarks:
    print(f"{benchmark.name}: {benchmark.description}")

Pagination#

Control the number of results returned per page and navigate through multiple pages of results.

# Get the first page with 50 benchmarks per page
benchmarks = client.evaluation.benchmarks.list(
    page=1,
    page_size=50
)

# Iterate through all pages
for benchmark in benchmarks:
    print(f"{benchmark.name}: {benchmark.description}")

# Get the second page
benchmarks_page_2 = client.evaluation.benchmarks.list(
    page=2,
    page_size=50
)

Sorting#

Sort benchmarks by different fields in ascending or descending order. Use - prefix for descending order.

# Sort by name (ascending)
benchmarks_by_name = client.evaluation.benchmarks.list(
    sort="name"
)

# Sort by creation date (most recent first)
benchmarks_recent = client.evaluation.benchmarks.list(
    sort="-created_at"
)

# Sort by update date (oldest first)
benchmarks_oldest_updated = client.evaluation.benchmarks.list(
    sort="updated_at"
)

Available sort fields:

name / -name: Sort by benchmark name
created_at / -created_at: Sort by creation timestamp
updated_at / -updated_at: Sort by last update timestamp

Extended Response#

Use extended_response=True to retrieve detailed benchmark information including datasets and metrics configuration.

# Get benchmarks with full details
benchmarks = client.evaluation.benchmarks.list(
    extended_response=True
)

Filter by Label#

You can add labels to custom benchmarks when creating them, that can then be used to filter on.

benchmarks = client.evaluation.benchmarks.list(
    extra_query={"search": {"data.labels.my-label": {"$eq": "my-label-value"}}}
)

Retrieve a Specific Benchmark#

Get detailed information about a specific benchmark by its name within the workspace set for your client.

# Retrieve a benchmark by name
benchmark = client.evaluation.benchmarks.retrieve(name="my-custom-benchmark")

Retrieve with Extended Response#

Use extended_response=True to get complete benchmark details including datasets and metrics.

# Retrieve benchmark with full configuration
benchmark = client.evaluation.benchmarks.retrieve(
    name="my-custom-benchmark",
    extended_response=True
)

Search#

Search benchmarks using JSON search queries passed via extra_query. The search supports operators $eq, $like, $lt, $lte, $gt, $gte, $in, $nin and logical operators $and, $or, $not on fields such as name, description, project, created_at, and updated_at.

# Search by name
benchmarks = client.evaluation.benchmarks.list(
    extra_query={"search": {"name": {"$like": "mmlu"}}}
)

# Combine multiple conditions
benchmarks = client.evaluation.benchmarks.list(
    extra_query={"search": {"$and": [{"name": {"$like": "mmlu"}}, {"description": {"$like": "reasoning"}}]}}
)

# Search by date range
benchmarks = client.evaluation.benchmarks.list(
    extra_query={"search": {"created_at": {"$gte": "2025-01-01T00:00:00", "$lte": "2025-06-30T23:59:59"}}}
)

Delete a Benchmark#

Delete a custom evaluation benchmark. Industry benchmarks in the system workspace cannot be deleted.

Warning

Deleting a benchmark is permanent and cannot be undone. Ensure the benchmark is not being used by any active evaluations before deletion.

client.evaluation.benchmarks.delete(name="my-custom-benchmark")