Agentic Benchmarks#
Evaluate agent tool-calling capabilities using industry-standard benchmarks.
Prerequisites#
Before running agentic benchmarks, ensure you have:
Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Model endpoint: A deployed model to evaluate for tool-calling capabilities.
API key secrets (for some benchmarks): Some BFCL benchmarks require external API keys.
import os
from nemo_platform import NeMoPlatform
client = NeMoPlatform(
base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
workspace="default",
)
Berkeley Function Calling Leaderboard (BFCL)#
BFCL is a benchmark for evaluating language model tool-calling capabilities. Use this evaluation type to benchmark tool-calling tasks using the Berkeley Function Calling Leaderboard.
View all available BFCL benchmarks (system/bfclv3-*) with a label filter:
bfcl_system_benchmarks = client.evaluation.benchmarks.list(
workspace="system",
extra_query={"search[data.labels.eval_harness]": "bfcl"}
)
print(bfcl_system_benchmarks)
BFCL benchmark evaluation requires a model for online evaluation.
Some BFCL benchmarks call external APIs that require API keys (for example, system/bfclv3-exec-* benchmarks). Create secrets for the API keys before referencing them in the job.
RapidAPI (free tier; subscription required): Yahoo Finance, Real‑Time Amazon Data, Urban Dictionary, COVID‑19, Time Zone by Location
Direct APIs: ExchangeRate‑API, OMDb, Geocode
client.secrets.create(
workspace=workspace,
name="rapid_api_key_secret",
data="<your RapidAPI key>"
)
benchmark_params = {
"rapid_api_key": "rapid_api_key_secret",
}
from nemo_platform.types.evaluation import SystemBenchmarkOnlineJobParam
job = client.evaluation.benchmark_jobs.create(
spec=SystemBenchmarkOnlineJobParam(
benchmark="system/bfclv3-live-simple",
model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
benchmark_params={},
)
)
Accuracy of tool call predictions
Score name: tool-calling-accuracy
Value range: 0.0–1.0
{
"scores": [
{
"name": "tool-calling-accuracy",
"score_type": "range",
"count": 1,
"nan_count": 0,
"sum": 1.0,
"mean": 1.0,
"min": 1.0,
"max": 1.0,
"std_dev": 0.0,
"variance": 0.0
}
]
}
Job Management#
After creating a job, navigate to Benchmark Job Management to oversee its execution and monitor progress.
See also
Agentic Evaluation Metrics - Detailed metric documentation for evaluating agentic workflows
Managing Secrets - Store API keys for external APIs
Evaluation Results - Understanding and downloading results