BFCL Evaluation Type#
BFCL is a benchmark for evaluating language model tool-calling capabilities. Use this evaluation type to benchmark tool-calling tasks using the Berkeley Function Calling Leaderboard or your own dataset.
Prerequisites#
Set up or select an existing compatible evaluation target.
Review custom data format requirements for BFCL
Supported Tasks#
Task Type |
Description |
Notes |
---|---|---|
simple |
Single function call per test case |
Most common academic/leaderboard category |
parallel |
Multiple function calls in parallel per test case |
|
multiple |
Multiple function calls in sequence per test case |
|
all |
Runs all available BFCL test categories |
Requires API keys for executable categories |
rest |
REST API-based tool-calling tasks |
Requires API keys |
exec_* |
Executable test categories (e.g., exec_multiple) |
Requires API keys; use for categories that call external APIs |
Options#
Academic#
You can evaluate different aspects of language model tool-calling by selecting a BFCL test category. To do this, set the type
field for your task to one of the supported categories. The configuration structure, data format, and result format are otherwise identical for all categories—whether you use the official (academic) dataset or a custom one.
{
"type": "bfcl",
"name": "my-bfcl-academic-config-1",
"namespace": "my-organization",
"params": {
"limit_samples": 5
},
"tasks": {
"task1": {
"type": "simple"
}
}
}
{
"messages": [
{"role": "user", "content": "What is the weather in Paris?"},
{"role": "assistant", "content": "Calling weather API...", "tool_calls": [
{"name": "get_weather", "args": {"location": "Paris"}}
]}
],
"tool_calls": [
{"name": "get_weather", "args": {"location": "Paris"}}
]
}
{
"tasks": {
"task1": {
"metrics": {
"Rank": {
"scores": {
"Rank": {
"value": 1.0
}
}
},
"Overall Acc": {
"scores": {
"Overall Acc": {
"value": 1.23
}
}
},
"Latency Mean (s)": {
"scores": {
"Latency Mean (s)": {
"value": 0.3
}
}
},
"Latency Standard Deviation (s)": {
"scores": {
"Latency Standard Deviation (s)": {
"value": 0.19
}
}
},
"Latency 95th Percentile (s)": {
"scores": {
"Latency 95th Percentile (s)": {
"value": 0.42
}
}
},
"Python Simple AST": {
"scores": {
"Python Simple AST": {
"value": 100.0
}
}
}
}
}
}
}
Custom Dataset#
{
"type": "bfcl",
"name": "my-bfcl-custom-config-1",
"namespace": "my-organization",
"params": {
"limit_samples": 5
},
"tasks": {
"task1": {
"type": "simple",
"dataset": {
"format": "native",
"files_url": "hf://datasets/<my-namespace>/<my-custom-bfcl-dataset>"
}
}
}
}
{
"messages": [
{"role": "user", "content": "Book a table for 2 at 7pm."},
{"role": "assistant", "content": "Booking a table...", "tool_calls": [
{"name": "book_table", "args": {"people": 2, "time": "7pm"}}
]}
],
"tool_calls": [
{"name": "book_table", "args": {"people": 2, "time": "7pm"}}
]
}
{
"tasks": {
"task1": {
"metrics": {
"tool-calling-accuracy": {
"scores": {
"tool-calling-accuracy": {
"value": 1.0
}
}
}
}
}
}
}
Metrics#
Metric Name |
Description |
Value Range |
Usable Task Types |
---|---|---|---|
|
Ranking of the correct tool call among model outputs (lower is better). |
1 (best) and up |
simple, parallel, multiple, all, exec_* |
|
Fraction of correct tool calls (overall accuracy). |
0.0 to 1.0 |
simple, parallel, multiple, all, exec_* |
|
Mean latency per test case in seconds. |
≥ 0.0 |
simple, parallel, multiple, all, exec_* |
|
Standard deviation of latency in seconds. |
≥ 0.0 |
simple, parallel, multiple, all, exec_* |
|
95th percentile latency in seconds. |
≥ 0.0 |
simple, parallel, multiple, all, exec_* |
|
Score for Python Abstract Syntax Tree (AST) match. |
0.0 to 100.0 |
simple, parallel, multiple, all, exec_* |
|
Fraction of correct tool calls (custom dataset accuracy). |
0.0 to 1.0 |
simple, parallel, multiple, all, rest, exec_* |
Each metric is reported under the metrics
object for the task, with its score(s) and value(s) provided in the scores
object. The set of metrics may vary depending on the task type and dataset.
Custom Dataset Format#
BFCL native datasets are organized as one file per test category, with ground truth answers in a separate directory:
data_dir/
├── BFCL_v3_<test_category>.json
└── possible_answer/
└── BFCL_v3_<test_category>.json