Discover Industry Benchmarks#
Discover industry benchmarks available to use for your evaluation job within the system workspace. List all industry benchmarks or filter by label category.
Note
The system workspace is a reserved workspace for NeMo Platform that contains ready-to-use benchmarks representing industry benchmarks with published datasets and metrics.
Note
Initialization: This example uses NeMoPlatform() with no arguments so the SDK reads your active CLI context (set by nmp auth login). The workspace="system" is passed per-call to access the reserved system workspace. For the standard quickstart init, see Initializing the CLI and SDK.
from nemo_platform import NeMoPlatform
client = NeMoPlatform()
benchmarks = client.evaluation.benchmarks.list(workspace="system")
print(f"{benchmarks.pagination.total_results} benchmarks")
for benchmark in benchmarks:
print(f"{benchmark.name}: {benchmark.description}")
# List benchmarks with pagination:
benchmarks = client.evaluation.benchmarks.list(
workspace="system",
page=2,
page_size=100
)
for benchmark in benchmarks:
print(f"{benchmark.name}: {benchmark.description}")
# Filter by evaluation category label
filtered_benchmarks = client.evaluation.benchmarks.list(
workspace="system",
extra_query={"search": {"data.labels.eval_category": {"$eq": "advanced_reasoning"}}},
)
print(filtered_benchmarks)
Category Label |
Description |
|---|---|
|
Evaluate the performance of agent-based or multi-step reasoning models, especially in scenarios requiring planning, tool use, and iterative reasoning. |
|
Evaluate reasoning capabilities of large language models through complex tasks. |
|
Evaluate code generation capabilities using functional correctness benchmarks that test synthesis of working programs. |
|
Evaluate model safety risks including vulnerability to generate harmful, biased, or misleading content. |
|
Evaluate the ability to follow explicit formatting and structural instructions |
|
Evaluate knowledge and reasoning across diverse subjects in different languages. |
|
Evaluate mathematical reasoning abilities. |
|
Evaluate the ability to generate answers to questions. |
|
Evaluate the quality of RAG pipelines by measuring both retrieval and answer generation performance. |
|
Evaluate the quality of document retriever pipelines. |