BFCL Evaluation Type#

BFCL is a benchmark for evaluating language model tool-calling capabilities. Use this evaluation type to benchmark tool-calling tasks using the Berkeley Function Calling Leaderboard or your own dataset.

Prerequisites#

  • Set up or select an existing model target

  • Review custom data format requirements for BFCL

Note

Performance Tuning: You can improve evaluation performance by setting config.params.parallelism to control the number of concurrent requests. If not set, the default value is 5.

Available Types: BFCL supports multiple benchmark versions: bfclv3, bfclv3_ast, bfclv2, bfclv2_ast


Supported Tasks#

Task Type

Description

Notes

simple

Single function call per test case

Most common academic/leaderboard category

parallel

Multiple function calls in parallel per test case

multiple

Multiple function calls in sequence per test case

all

Runs all available BFCL test categories

Requires API keys for executable categories

rest

REST API-based tool-calling tasks

Requires API keys

exec_*

Executable test categories (e.g., exec_multiple)

Requires API keys; use for categories that call external APIs

API Keys#

4 API keys are required for BFCL executable test categories that call external APIs.

  • RAPID-API Key: https://rapidapi.com/hub

    • All the Rapid APIs we use have free tier usage. You need to subscribe to those API providers in order to have the executable test environment setup but it will be free of charge!

    • Yahoo Finance: https://rapidapi.com/sparior/api/yahoo-finance15

    • Real Time Amazon Data : https://rapidapi.com/letscrape-6bRBa3QguO5/api/real-time-amazon-data

    • Urban Dictionary: https://rapidapi.com/community/api/urban-dictionary

    • Covid 19: https://rapidapi.com/api-sports/api/covid-193

    • Time zone by Location: https://rapidapi.com/BertoldVdb/api/timezone-by-location

  • Exchange Rate API: https://www.exchangerate-api.com

  • OMDB API: http://www.omdbapi.com/apikey.aspx

  • Geocode API: https://geocode.maps.co/

Configure the API keys

{
  "type": "bfclv3",
  "params": {
    "extra": {
      "rapid_api_key": "<RAPID_API_KEY>",
      "exchangerate_api_key": "<EXCHANGERATE_API_KEY>",
      "omdb_api_key": "<OMDB_API_KEY>",
      "geocode_api_key": "<GEOCODE_API_KEY>"
    }
  }
}

Options#

Academic#

You can evaluate different aspects of language model tool-calling by selecting a BFCL test category. To do this, set the type field for your task to one of the supported categories. The configuration structure, data format, and result format are otherwise identical for all categories—whether you use the official (academic) dataset or a custom one.

{
    "type": "bfclv3",
    "name": "my-bfcl-academic-config-1",
    "namespace": "my-organization",
    "params": {
        "limit_samples": 5,
        "parallelism": 5
    },
    "tasks": {
        "task1": {
            "type": "simple"
        }
    }
}
{
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {"role": "assistant", "content": "Calling weather API...", "tool_calls": [
      {"name": "get_weather", "args": {"location": "Paris"}}
    ]}
  ],
  "tool_calls": [
    {"name": "get_weather", "args": {"location": "Paris"}}
  ]
}
{
  "tasks": {
    "task1": {
      "metrics": {
        "Rank": {
          "scores": {
            "Rank": {
              "value": 1.0
            }
          }
        },
        "Overall Acc": {
          "scores": {
            "Overall Acc": {
              "value": 1.23
            }
          }
        },
        "Latency Mean (s)": {
          "scores": {
            "Latency Mean (s)": {
              "value": 0.3
            }
          }
        },
        "Latency Standard Deviation (s)": {
          "scores": {
            "Latency Standard Deviation (s)": {
              "value": 0.19
            }
          }
        },
        "Latency 95th Percentile (s)": {
          "scores": {
            "Latency 95th Percentile (s)": {
              "value": 0.42
            }
          }
        },
        "Python Simple AST": {
          "scores": {
            "Python Simple AST": {
              "value": 100.0
            }
          }
        }
      }
    }
  }
}

Custom Dataset#

{
    "type": "bfclv3",
    "name": "my-bfcl-custom-config-1",
    "namespace": "my-organization",
    "params": {
        "limit_samples": 5
    },
    "tasks": {
        "task1": {
            "type": "simple",
            "dataset": {
                "format": "native",
                "files_url": "hf://datasets/<my-namespace>/<my-custom-bfcl-dataset>"
            }
        }
    }
}
{
  "messages": [
    {"role": "user", "content": "Book a table for 2 at 7pm."},
    {"role": "assistant", "content": "Booking a table...", "tool_calls": [
      {"name": "book_table", "args": {"people": 2, "time": "7pm"}}
    ]}
  ],
  "tool_calls": [
    {"name": "book_table", "args": {"people": 2, "time": "7pm"}}
  ]
}
{
  "tasks": {
    "task1": {
      "metrics": {
        "tool-calling-accuracy": {
          "scores": {
            "tool-calling-accuracy": {
              "value": 1.0
            }
          }
        }
      }
    }
  }
}

Metrics#

BFCL Supported Metrics by Task Type#

Metric Name

Description

Value Range

Usable Task Types

Rank

Ranking of the correct tool call among model outputs (lower is better).

1 (best) and up

simple, parallel, multiple, all, exec_*

Overall Acc

Fraction of correct tool calls (overall accuracy).

0.0 to 1.0

simple, parallel, multiple, all, exec_*

Latency Mean (s)

Mean latency per test case in seconds.

≥ 0.0

simple, parallel, multiple, all, exec_*

Latency Standard Deviation (s)

Standard deviation of latency in seconds.

≥ 0.0

simple, parallel, multiple, all, exec_*

Latency 95th Percentile (s)

95th percentile latency in seconds.

≥ 0.0

simple, parallel, multiple, all, exec_*

Python Simple AST

Score for Python Abstract Syntax Tree (AST) match.

0.0 to 100.0

simple, parallel, multiple, all, exec_*

tool-calling-accuracy

Fraction of correct tool calls (custom dataset accuracy).

0.0 to 1.0

simple, parallel, multiple, all, rest, exec_*

Each metric is reported under the metrics object for the task, with its score(s) and value(s) provided in the scores object. The set of metrics may vary depending on the task type and dataset.

Custom Dataset Format#

BFCL native datasets are organized as one file per test category, with ground truth answers in a separate directory:

data_dir/
├── BFCL_v3_<test_category>.json
└── possible_answer/
    └── BFCL_v3_<test_category>.json