nemo_automodel.components.eval.tool_call_parser

View as Markdown

Generic tool-call parser for evaluating agent SFT outputs.

Different chat templates wrap tool calls in different syntax:

  • Qwen / Hermes / FunctionGemma / Gemma 3 / GLM-4: <tool_call>{"name": ..., "arguments": ...}</tool_call>
  • Llama 3.1+: <|python_tag|>{"name": ..., "parameters": ...}<|eom_id|>
  • Mistral: [TOOL_CALLS][{...}, {...}]
  • Harmony / GPT-OSS: <|channel|>commentary to=functions.NAME<|message|>{...}

This parser tries each known wrapper, then falls back to a generic JSON object scan. It is intentionally permissive: malformed JSON, missing wrappers, or unknown formats degrade gracefully and never raise.

The companion :func:compute_sample_metrics compares parser output against ground-truth tool calls and produces 0/1 (or fractional) indicators that average cleanly across a dataset.

Module Contents

Classes

NameDescription
ParsedToolCallOne tool call extracted from generated text.

Functions

NameDescription
_coerce_argsNormalize an arguments field to a dict.
_coerce_gt_argsNormalize a ground-truth arguments field to a dict.
_extract_balancedReturn the substring from text[start] (which must be opener)
_from_call_objectBuild a :class:ParsedToolCall from a {"name": ..., "arguments": ...} dict.
_iter_balanced_json_objectsYield substrings that look like balanced top-level JSON objects.
_parse_generic_jsonLast-resort fallback: scan for any JSON object with a name field.
_parse_harmony_style-
_parse_llama_style-
_parse_mistral_style-
_parse_qwen_style-
_score_one_pairScore a single (pred, gt) tool-call pair. pred may be None.
compute_sample_metricsCompute per-sample tool-call metrics across all GT positions.
parse_tool_callsExtract every tool call from a generated model response.

Data

_HARMONY_ANCHOR_RE

_LLAMA_ANCHOR_RE

_MISTRAL_ANCHOR_RE

_QWEN_RE

logger

API

class nemo_automodel.components.eval.tool_call_parser.ParsedToolCall(
name: typing.Optional[str],
arguments: typing.Dict[str, typing.Any],
arguments_valid_json: bool,
raw: str
)
Dataclass

One tool call extracted from generated text.

arguments
Dict[str, Any]
arguments_valid_json
bool
name
Optional[str]
raw
str
nemo_automodel.components.eval.tool_call_parser._coerce_args(
args_value: typing.Any
) -> typing.Tuple[typing.Dict[str, typing.Any], bool]

Normalize an arguments field to a dict.

Accepts a dict (passthrough) or a JSON-encoded string. Returns the parsed dict alongside a flag indicating whether the source was a well-formed JSON object.

nemo_automodel.components.eval.tool_call_parser._coerce_gt_args(
args_value: typing.Any
) -> typing.Dict[str, typing.Any]

Normalize a ground-truth arguments field to a dict.

nemo_automodel.components.eval.tool_call_parser._extract_balanced(
text: str,
start: int,
opener: str,
closer: str
) -> typing.Optional[str]

Return the substring from text[start] (which must be opener) through its matching closer, skipping over chars inside JSON strings.

Returns None if text[start] is not opener or the span is unbalanced.

nemo_automodel.components.eval.tool_call_parser._from_call_object(
obj: typing.Dict[str, typing.Any],
raw: str
) -> typing.Optional[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]

Build a :class:ParsedToolCall from a {"name": ..., "arguments": ...} dict.

Llama 3.1 emits parameters instead of arguments; both are accepted. Returns None when name is missing or non-string.

nemo_automodel.components.eval.tool_call_parser._iter_balanced_json_objects(
text: str
) -> typing.Iterator[str]

Yield substrings that look like balanced top-level JSON objects.

Skips characters inside JSON string literals (so braces inside strings don’t unbalance the count). Designed for fallback extraction when no known wrapper matches.

nemo_automodel.components.eval.tool_call_parser._parse_generic_json(
text: str
) -> typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]

Last-resort fallback: scan for any JSON object with a name field.

nemo_automodel.components.eval.tool_call_parser._parse_harmony_style(
text: str
) -> typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]
nemo_automodel.components.eval.tool_call_parser._parse_llama_style(
text: str
) -> typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]
nemo_automodel.components.eval.tool_call_parser._parse_mistral_style(
text: str
) -> typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]
nemo_automodel.components.eval.tool_call_parser._parse_qwen_style(
text: str
) -> typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]
nemo_automodel.components.eval.tool_call_parser._score_one_pair(
pred: typing.Optional[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall],
gt: typing.Dict[str, typing.Any]
) -> typing.Dict[str, float]

Score a single (pred, gt) tool-call pair. pred may be None.

nemo_automodel.components.eval.tool_call_parser.compute_sample_metrics(
pred_calls: typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall],
gt_calls: typing.List[typing.Dict[str, typing.Any]]
) -> typing.Dict[str, float]

Compute per-sample tool-call metrics across all GT positions.

Predicted calls are aligned positionally against the ground-truth list: pred_calls[i] is scored against gt_calls[i]. Missing predictions (i >= len(pred_calls)) contribute zeros across every metric for that position, so a model that emits only one of two parallel tool calls is correctly penalized on the missing call.

Extra predictions beyond len(gt_calls) are ignored. All values are in [0.0, 1.0] so callers can mean() across a dataset.

Returned keys:

  • has_call: prediction exists at this position.
  • name_correct: predicted call name equals GT name.
  • args_json_valid: prediction had valid JSON arguments.
  • args_field_recall: fraction of GT argument keys present in pred.
  • args_field_precision: fraction of pred argument keys present in GT.
  • args_exact_match: pred arguments dict equals GT arguments dict.

Parameters:

pred_calls
List[ParsedToolCall]

output of :func:parse_tool_calls.

gt_calls
List[Dict[str, Any]]

ground-truth list of {"name": str, "arguments": dict|str}.

nemo_automodel.components.eval.tool_call_parser.parse_tool_calls(
text: str
) -> typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]

Extract every tool call from a generated model response.

Wrappers are tried in order of specificity; the first wrapper that yields any match wins. If no wrapper matches, a generic JSON-object scan is used. Returns an empty list when no plausible tool call is present.

Parameters:

text
str

raw decoded text from model.generate().

Returns: List[ParsedToolCall]

Parsed tool calls in document order. Possibly empty.

nemo_automodel.components.eval.tool_call_parser._HARMONY_ANCHOR_RE = re.compile('<\\|channel\\|>\\s*commentary\\s+to=functions\\.(?P<name>[\\w\\-]+)[...
nemo_automodel.components.eval.tool_call_parser._LLAMA_ANCHOR_RE = re.compile('<\\|python_tag\\|>\\s*')
nemo_automodel.components.eval.tool_call_parser._MISTRAL_ANCHOR_RE = re.compile('\\[TOOL_CALLS\\]\\s*')
nemo_automodel.components.eval.tool_call_parser._QWEN_RE = re.compile('<tool_call>\\s*(.+?)\\s*</tool_call>', re.DOTALL)
nemo_automodel.components.eval.tool_call_parser.logger = logging.getLogger(__name__)