Discover Industry Benchmarks#

Discover industry benchmarks available to use for your evaluation job within the system workspace. List all industry benchmarks or filter by label category.

Note

The system workspace is a reserved workspace for NeMo Platform that contains ready-to-use benchmarks representing industry benchmarks with published datasets and metrics.

Note

Initialization: This example uses NeMoPlatform() with no arguments so the SDK reads your active CLI context (set by nmp auth login). The workspace="system" is passed per-call to access the reserved system workspace. For the standard quickstart init, see Initializing the CLI and SDK.

from nemo_platform import NeMoPlatform

client = NeMoPlatform()

benchmarks = client.evaluation.benchmarks.list(workspace="system")

print(f"{benchmarks.pagination.total_results} benchmarks")
for benchmark in benchmarks:
    print(f"{benchmark.name}: {benchmark.description}")

# List benchmarks with pagination:
benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    page=2,
    page_size=100
)
for benchmark in benchmarks:
    print(f"{benchmark.name}: {benchmark.description}")

# Filter by evaluation category label
filtered_benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    extra_query={"search": {"data.labels.eval_category": {"$eq": "advanced_reasoning"}}},
)
print(filtered_benchmarks)

Category Label	Description
`agentic`	Evaluate the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning.
`advanced_reasoning`	Evaluate reasoning capabilities of large language models through complex tasks.
`code`	Evaluate code generation capabilities using functional correctness benchmarks that test synthesis of working programs.
`content_safety`	Evaluate model safety risks including vulnerability to generate harmful, biased, or misleading content.
`instruction_following`	Evaluate the ability to follow explicit formatting and structural instructions
`language_understanding`	Evaluate knowledge and reasoning across diverse subjects in different languages.
`math`	Evaluate mathematical reasoning abilities.
`question_answering`	Evaluate the ability to generate answers to questions.
`rag`	Evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance.
`retrieval`	Evaluate the quality of document retriever pipelines.