Agentic Benchmarks#

Evaluate agent tool-calling capabilities using industry-standard benchmarks.

Prerequisites#

Before running agentic benchmarks, ensure you have:

Workspace: Have a workspace created. All resources (metrics, secrets, jobs) are scoped to a workspace.
Model endpoint: A deployed model to evaluate for tool-calling capabilities.
API key secrets (for some benchmarks): Some BFCL benchmarks require external API keys.

import os
from nemo_platform import NeMoPlatform

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

Berkeley Function Calling Leaderboard (BFCL)#

BFCL is a benchmark for evaluating language model tool-calling capabilities. Use this evaluation type to benchmark tool-calling tasks using the Berkeley Function Calling Leaderboard.

View all available BFCL benchmarks (system/bfclv3-*) with a label filter:

bfcl_system_benchmarks = client.evaluation.benchmarks.list(
    workspace="system",
    extra_query={"search[data.labels.eval_harness]": "bfcl"}
)
print(bfcl_system_benchmarks)

BFCL benchmark evaluation requires a model for online evaluation.

Some BFCL benchmarks call external APIs that require API keys (for example, system/bfclv3-exec-* benchmarks). Create secrets for the API keys before referencing them in the job.

RapidAPI (free tier; subscription required): Yahoo Finance, Real‑Time Amazon Data, Urban Dictionary, COVID‑19, Time Zone by Location
Direct APIs: ExchangeRate‑API, OMDb, Geocode

client.secrets.create(
    workspace=workspace,
    name="rapid_api_key_secret",
    data="<your RapidAPI key>"
)
benchmark_params = {
    "rapid_api_key": "rapid_api_key_secret",
}

Job

from nemo_platform.types.evaluation import SystemBenchmarkOnlineJobParam

job = client.evaluation.benchmark_jobs.create(
    spec=SystemBenchmarkOnlineJobParam(
        benchmark="system/bfclv3-live-simple",
        model={"url": "<your-nim-endpoint>/v1", "name": "nvidia/llama-3.3-nemotron-super-49b-v1"},
        benchmark_params={},
    )
)

Result

Accuracy of tool call predictions

Score name: tool-calling-accuracy
Value range: 0.0–1.0

{
  "scores": [
    {
      "name": "tool-calling-accuracy",
      "score_type": "range",
      "count": 1,
      "nan_count": 0,
      "sum": 1.0,
      "mean": 1.0,
      "min": 1.0,
      "max": 1.0,
      "std_dev": 0.0,
      "variance": 0.0
    }
  ]
}

Job Management#

After creating a job, navigate to Benchmark Job Management to oversee its execution and monitor progress.