nemo_automodel.components.eval.tool_call_parser#

Generic tool-call parser for evaluating agent SFT outputs.

Different chat templates wrap tool calls in different syntax:

  • Qwen / Hermes / FunctionGemma / Gemma 3 / GLM-4: <tool_call>{"name": ..., "arguments": ...}</tool_call>

  • Llama 3.1+: <|python_tag|>{"name": ..., "parameters": ...}<|eom_id|>

  • Mistral: [TOOL_CALLS][{...}, {...}]

  • Harmony / GPT-OSS: <|channel|>commentary to=functions.NAME<|message|>{...}

This parser tries each known wrapper, then falls back to a generic JSON object scan. It is intentionally permissive: malformed JSON, missing wrappers, or unknown formats degrade gracefully and never raise.

The companion :func:compute_sample_metrics compares parser output against ground-truth tool calls and produces 0/1 (or fractional) indicators that average cleanly across a dataset.

Module Contents#

Classes#

ParsedToolCall

One tool call extracted from generated text.

Functions#

_extract_balanced

Return the substring from text[start] (which must be opener) through its matching closer, skipping over chars inside JSON strings.

_coerce_args

Normalize an arguments field to a dict.

_from_call_object

Build a :class:ParsedToolCall from a {"name": ..., "arguments": ...} dict.

_iter_balanced_json_objects

Yield substrings that look like balanced top-level JSON objects.

_parse_qwen_style

_parse_llama_style

_parse_mistral_style

_parse_harmony_style

_parse_generic_json

Last-resort fallback: scan for any JSON object with a name field.

parse_tool_calls

Extract every tool call from a generated model response.

_coerce_gt_args

Normalize a ground-truth arguments field to a dict.

_score_one_pair

Score a single (pred, gt) tool-call pair. pred may be None.

compute_sample_metrics

Compute per-sample tool-call metrics across all GT positions.

Data#

API#

nemo_automodel.components.eval.tool_call_parser.logger#

‘getLogger(…)’

class nemo_automodel.components.eval.tool_call_parser.ParsedToolCall[source]#

One tool call extracted from generated text.

.. attribute:: name

function name if extracted, otherwise None.

.. attribute:: arguments

parsed arguments dict; empty when JSON is invalid.

.. attribute:: arguments_valid_json

True if arguments parsed cleanly.

.. attribute:: raw

the source substring this was parsed from.

name: Optional[str]#

None

arguments: Dict[str, Any]#

None

arguments_valid_json: bool#

None

raw: str#

None

nemo_automodel.components.eval.tool_call_parser._HARMONY_ANCHOR_RE#

‘compile(…)’

nemo_automodel.components.eval.tool_call_parser._QWEN_RE#

‘compile(…)’

nemo_automodel.components.eval.tool_call_parser._LLAMA_ANCHOR_RE#

‘compile(…)’

nemo_automodel.components.eval.tool_call_parser._MISTRAL_ANCHOR_RE#

‘compile(…)’

nemo_automodel.components.eval.tool_call_parser._extract_balanced(
text: str,
start: int,
opener: str,
closer: str,
) Optional[str][source]#

Return the substring from text[start] (which must be opener) through its matching closer, skipping over chars inside JSON strings.

Returns None if text[start] is not opener or the span is unbalanced.

nemo_automodel.components.eval.tool_call_parser._coerce_args(
args_value: Any,
) Tuple[Dict[str, Any], bool][source]#

Normalize an arguments field to a dict.

Accepts a dict (passthrough) or a JSON-encoded string. Returns the parsed dict alongside a flag indicating whether the source was a well-formed JSON object.

nemo_automodel.components.eval.tool_call_parser._from_call_object(
obj: Dict[str, Any],
raw: str,
) Optional[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#

Build a :class:ParsedToolCall from a {"name": ..., "arguments": ...} dict.

Llama 3.1 emits parameters instead of arguments; both are accepted. Returns None when name is missing or non-string.

nemo_automodel.components.eval.tool_call_parser._iter_balanced_json_objects(text: str) Iterator[str][source]#

Yield substrings that look like balanced top-level JSON objects.

Skips characters inside JSON string literals (so braces inside strings don’t unbalance the count). Designed for fallback extraction when no known wrapper matches.

nemo_automodel.components.eval.tool_call_parser._parse_qwen_style(
text: str,
) List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#
nemo_automodel.components.eval.tool_call_parser._parse_llama_style(
text: str,
) List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#
nemo_automodel.components.eval.tool_call_parser._parse_mistral_style(
text: str,
) List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#
nemo_automodel.components.eval.tool_call_parser._parse_harmony_style(
text: str,
) List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#
nemo_automodel.components.eval.tool_call_parser._parse_generic_json(
text: str,
) List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#

Last-resort fallback: scan for any JSON object with a name field.

nemo_automodel.components.eval.tool_call_parser.parse_tool_calls(
text: str,
) List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#

Extract every tool call from a generated model response.

Wrappers are tried in order of specificity; the first wrapper that yields any match wins. If no wrapper matches, a generic JSON-object scan is used. Returns an empty list when no plausible tool call is present.

Parameters:

text – raw decoded text from model.generate().

Returns:

Parsed tool calls in document order. Possibly empty.

nemo_automodel.components.eval.tool_call_parser._coerce_gt_args(
args_value: Any,
) Dict[str, Any][source]#

Normalize a ground-truth arguments field to a dict.

nemo_automodel.components.eval.tool_call_parser._score_one_pair(
pred: Optional[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall],
gt: Dict[str, Any],
) Dict[str, float][source]#

Score a single (pred, gt) tool-call pair. pred may be None.

nemo_automodel.components.eval.tool_call_parser.compute_sample_metrics(
pred_calls: List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall],
gt_calls: List[Dict[str, Any]],
) Dict[str, float][source]#

Compute per-sample tool-call metrics across all GT positions.

Predicted calls are aligned positionally against the ground-truth list: pred_calls[i] is scored against gt_calls[i]. Missing predictions (i >= len(pred_calls)) contribute zeros across every metric for that position, so a model that emits only one of two parallel tool calls is correctly penalized on the missing call.

Extra predictions beyond len(gt_calls) are ignored. All values are in [0.0, 1.0] so callers can mean() across a dataset.

Returned keys:

  • has_call: prediction exists at this position.

  • name_correct: predicted call name equals GT name.

  • args_json_valid: prediction had valid JSON arguments.

  • args_field_recall: fraction of GT argument keys present in pred.

  • args_field_precision: fraction of pred argument keys present in GT.

  • args_exact_match: pred arguments dict equals GT arguments dict.

Parameters:
  • pred_calls – output of :func:parse_tool_calls.

  • gt_calls – ground-truth list of {"name": str, "arguments": dict|str}.