BFCL Evaluations#

BFCL is a benchmark for evaluating language model tool-calling capabilities. Use this evaluation type to benchmark tool-calling tasks using the Berkeley Function Calling Leaderboard or your own dataset.

Tip

Want to experiment first? You can try these benchmarks using the open-source NeMo Evaluator SDK before deploying the microservice. The SDK provides a lightweight way to test evaluation workflows locally.

Prerequisites#

You can improve evaluation performance by setting config.params.parallelism to control the number of concurrent requests.
BFCL supports multiple benchmark versions: bfclv3, bfclv3_ast, bfclv2, bfclv2_ast

Custom Datasets

Upload your dataset to NeMo Data Store using Hugging Face CLI or SDK
Register your dataset in NeMo Entity Store using the Dataset APIs
Format your data according to the BFCL format requirements (native or OpenAI format)

Tip

For a complete dataset creation walkthrough, see the dataset management tutorials or follow the end-to-end evaluation example.

API Keys

You need four API keys at minimum for BFCL executable test categories that call external APIs.

RapidAPI (free tier; subscription required): Yahoo Finance, Real‑Time Amazon Data, Urban Dictionary, COVID‑19, Time Zone by Location
Direct APIs: ExchangeRate‑API, OMDb, Geocode

Example Configuration

{
  "type": "bfclv3",
  "params": {
    "extra": {
      "rapid_api_key": "<RAPID_API_KEY>",
      "exchangerate_api_key": "<EXCHANGERATE_API_KEY>",
      "omdb_api_key": "<OMDB_API_KEY>",
      "geocode_api_key": "<GEOCODE_API_KEY>"
    }
  }
}

Example Job Execution#

You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config> with configs shown on this page:

Note

See Job Target and Configuration Matrix for details on target / config compatibility.

Python SDK v2

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
    spec={
        "target": {
            "type": "model",
            "name": "my-target-dataset-1",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            },
        },
        "config": <my-eval-config>
    }
)

cURL v2

curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "spec": {
            "target": {
                "type": "model",
                "name": "my-target-dataset-1",
                "namespace": "my-organization",
                "model": {
                    "api_endpoint": {
                        # Replace NIM_BASE_URL with your specific deployment
                        "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                        "model_id": "meta/llama-3.1-8b-instruct"
                    }
                }
            },
            "config": <my-eval-config>
        }
    }'

Python SDK v1

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
    namespace="my-organization",
    target={
        "type": "model",
        "namespace": "my-organization",
        "model": {
            "api_endpoint": {
                # Replace NIM_BASE_URL with your specific deployment
                "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                "model_id": "meta/llama-3.1-8b-instruct"
            }
        },
    },
    config=<my-eval-config>
)

cURL v1

curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '
    {
        "namespace": "my-organization",
        "target": {
            "type": "model",
            "namespace": "my-organization",
            "model": {
                "api_endpoint": {
                    # Replace NIM_BASE_URL with your specific deployment
                    "url": f"{NIM_BASE_URL}/v1/chat/completions", 
                    "model_id": "meta/llama-3.1-8b-instruct"
                }
            }
        },
        "config": <my-eval-config>
    }'

For a full example, see Run an Academic LM Harness Eval

Supported Tasks#

Task Type	Description	Notes
simple	Single function call per test case	Most common academic/leaderboard category
parallel	Multiple function calls in parallel per test case
multiple	Multiple function calls in sequence per test case
all	Runs all available BFCL test categories	Requires API keys for executable categories
rest	REST API-based tool-calling tasks	Requires API keys
exec_*	Executable test categories (e.g., exec_multiple)	Requires API keys; use for categories that call external APIs

Academic#

You can evaluate different aspects of language model tool-calling by selecting a BFCL test category. To do this, set the type field for your task to one of the supported categories. The configuration structure, data format, and result format are otherwise identical for all categories—whether you use the official (academic) dataset or a custom one.

Config

{
    "type": "bfclv3",
    "params": {
        "limit_samples": 5,
        "parallelism": 5
    },
    "tasks": {
        "task1": {
            "type": "simple"
        }
    }
}

Data Format

{
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {"role": "assistant", "content": "Calling weather API...", "tool_calls": [
      {"name": "get_weather", "args": {"location": "Paris"}}
    ]}
  ],
  "tool_calls": [
    {"name": "get_weather", "args": {"location": "Paris"}}
  ]
}

Result

{
  "tasks": {
    "task1": {
      "metrics": {
        "Rank": {
          "scores": {
            "Rank": {
              "value": 1.0
            }
          }
        },
        "Overall Acc": {
          "scores": {
            "Overall Acc": {
              "value": 1.23
            }
          }
        },
        "Latency Mean (s)": {
          "scores": {
            "Latency Mean (s)": {
              "value": 0.3
            }
          }
        },
        "Latency Standard Deviation (s)": {
          "scores": {
            "Latency Standard Deviation (s)": {
              "value": 0.19
            }
          }
        },
        "Latency 95th Percentile (s)": {
          "scores": {
            "Latency 95th Percentile (s)": {
              "value": 0.42
            }
          }
        },
        "Python Simple AST": {
          "scores": {
            "Python Simple AST": {
              "value": 100.0
            }
          }
        }
      }
    }
  }
}

Custom Dataset#

Config

{
    "type": "bfclv3",
    "params": {
        "limit_samples": 5
    },
    "tasks": {
        "task1": {
            "type": "simple",
            "dataset": {
                "format": "native",
                "files_url": "hf://datasets/<my-namespace>/<my-custom-bfcl-dataset>"
            }
        }
    }
}

Data Format

{
  "messages": [
    {"role": "user", "content": "Book a table for 2 at 7pm."},
    {"role": "assistant", "content": "Booking a table...", "tool_calls": [
      {"name": "book_table", "args": {"people": 2, "time": "7pm"}}
    ]}
  ],
  "tool_calls": [
    {"name": "book_table", "args": {"people": 2, "time": "7pm"}}
  ]
}

Result

{
  "tasks": {
    "task1": {
      "metrics": {
        "tool-calling-accuracy": {
          "scores": {
            "tool-calling-accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Metrics#

BFCL Supported Metrics by Task Type#
Metric Name	Description	Value Range	Usable Task Types
`Rank`	Ranking of the correct tool call among model outputs (lower is better).	1 (best) and up	simple, parallel, multiple, all, exec_*
`Overall Acc`	Fraction of correct tool calls (overall accuracy).	0.0 to 1.0	simple, parallel, multiple, all, exec_*
`Latency Mean (s)`	Mean latency per test case in seconds.	≥ 0.0	simple, parallel, multiple, all, exec_*
`Latency Standard Deviation (s)`	Standard deviation of latency in seconds.	≥ 0.0	simple, parallel, multiple, all, exec_*
`Latency 95th Percentile (s)`	95th percentile latency in seconds.	≥ 0.0	simple, parallel, multiple, all, exec_*
`Python Simple AST`	Score for Python Abstract Syntax Tree (AST) match.	0.0 to 100.0	simple, parallel, multiple, all, exec_*
`tool-calling-accuracy`	Fraction of correct tool calls (custom dataset accuracy).	0.0 to 1.0	simple, parallel, multiple, all, rest, exec_*

Each metric is reported under the metrics object for the task, with its score(s) and value(s) provided in the scores object. The set of metrics may vary depending on the task type and dataset.

Custom Dataset Format#

BFCL native datasets are organized as one file per test category, with ground truth answers in a separate directory:

data_dir/
├── BFCL_v3_<test_category>.json
└── possible_answer/
    └── BFCL_v3_<test_category>.json