`nemo_automodel.components.eval.tool_call_parser`#

Generic tool-call parser for evaluating agent SFT outputs.

Different chat templates wrap tool calls in different syntax:

Qwen / Hermes / FunctionGemma / Gemma 3 / GLM-4: <tool_call>{"name": ..., "arguments": ...}</tool_call>
Llama 3.1+: <|python_tag|>{"name": ..., "parameters": ...}<|eom_id|>
Mistral: [TOOL_CALLS][{...}, {...}]
Harmony / GPT-OSS: <|channel|>commentary to=functions.NAME<|message|>{...}

This parser tries each known wrapper, then falls back to a generic JSON object scan. It is intentionally permissive: malformed JSON, missing wrappers, or unknown formats degrade gracefully and never raise.

The companion :func:compute_sample_metrics compares parser output against ground-truth tool calls and produces 0/1 (or fractional) indicators that average cleanly across a dataset.

Module Contents#

Classes#

ParsedToolCall

One tool call extracted from generated text.

Functions#

`_extract_balanced`	Return the substring from `text[start]` (which must be `opener`) through its matching `closer`, skipping over chars inside JSON strings.
`_coerce_args`	Normalize an `arguments` field to a dict.
`_from_call_object`	Build a :class:`ParsedToolCall` from a `{"name": ..., "arguments": ...}` dict.
`_iter_balanced_json_objects`	Yield substrings that look like balanced top-level JSON objects.
`_parse_qwen_style`
`_parse_llama_style`
`_parse_mistral_style`
`_parse_harmony_style`
`_parse_generic_json`	Last-resort fallback: scan for any JSON object with a `name` field.
`parse_tool_calls`	Extract every tool call from a generated model response.
`_coerce_gt_args`	Normalize a ground-truth `arguments` field to a dict.
`_score_one_pair`	Score a single (pred, gt) tool-call pair. `pred` may be `None`.
`compute_sample_metrics`	Compute per-sample tool-call metrics across all GT positions.

Data#

`logger`
`_HARMONY_ANCHOR_RE`
`_QWEN_RE`
`_LLAMA_ANCHOR_RE`
`_MISTRAL_ANCHOR_RE`

API#

nemo_automodel.components.eval.tool_call_parser.logger#: ‘getLogger(…)’

class nemo_automodel.components.eval.tool_call_parser.ParsedToolCall[source]#

One tool call extracted from generated text.

.. attribute:: name

function name if extracted, otherwise None.

.. attribute:: arguments

parsed arguments dict; empty when JSON is invalid.

.. attribute:: arguments_valid_json

True if arguments parsed cleanly.

.. attribute:: raw

the source substring this was parsed from.

name: Optional[str]#: None

arguments: Dict[str, Any]#: None

arguments_valid_json: bool#: None

raw: str#: None

nemo_automodel.components.eval.tool_call_parser._HARMONY_ANCHOR_RE#: ‘compile(…)’

nemo_automodel.components.eval.tool_call_parser._QWEN_RE#: ‘compile(…)’

nemo_automodel.components.eval.tool_call_parser._LLAMA_ANCHOR_RE#: ‘compile(…)’

nemo_automodel.components.eval.tool_call_parser._MISTRAL_ANCHOR_RE#: ‘compile(…)’

nemo_automodel.components.eval.tool_call_parser._extract_balanced( text: str, start: int, opener: str, closer: str, ) → Optional[str][source]#

Return the substring from text[start] (which must be opener) through its matching closer, skipping over chars inside JSON strings.

Returns None if text[start] is not opener or the span is unbalanced.

nemo_automodel.components.eval.tool_call_parser._coerce_args( args_value: Any, ) → Tuple[Dict[str, Any], bool][source]#

Normalize an arguments field to a dict.

Accepts a dict (passthrough) or a JSON-encoded string. Returns the parsed dict alongside a flag indicating whether the source was a well-formed JSON object.

nemo_automodel.components.eval.tool_call_parser._from_call_object( obj: Dict[str, Any], raw: str, ) → Optional[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#

Build a :class:ParsedToolCall from a {"name": ..., "arguments": ...} dict.

Llama 3.1 emits parameters instead of arguments; both are accepted. Returns None when name is missing or non-string.

nemo_automodel.components.eval.tool_call_parser._iter_balanced_json_objects(text: str) → Iterator[str][source]#

Yield substrings that look like balanced top-level JSON objects.

Skips characters inside JSON string literals (so braces inside strings don’t unbalance the count). Designed for fallback extraction when no known wrapper matches.

nemo_automodel.components.eval.tool_call_parser._parse_qwen_style( text: str, ) → List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#

nemo_automodel.components.eval.tool_call_parser._parse_llama_style( text: str, ) → List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#

nemo_automodel.components.eval.tool_call_parser._parse_mistral_style( text: str, ) → List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#

nemo_automodel.components.eval.tool_call_parser._parse_harmony_style( text: str, ) → List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#

nemo_automodel.components.eval.tool_call_parser._parse_generic_json( text: str, ) → List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#: Last-resort fallback: scan for any JSON object with a name field.

nemo_automodel.components.eval.tool_call_parser.parse_tool_calls( text: str, ) → List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall][source]#

Extract every tool call from a generated model response.

Wrappers are tried in order of specificity; the first wrapper that yields any match wins. If no wrapper matches, a generic JSON-object scan is used. Returns an empty list when no plausible tool call is present.

Parameters:: text – raw decoded text from model.generate().
Returns:: Parsed tool calls in document order. Possibly empty.

nemo_automodel.components.eval.tool_call_parser._coerce_gt_args( args_value: Any, ) → Dict[str, Any][source]#: Normalize a ground-truth arguments field to a dict.

nemo_automodel.components.eval.tool_call_parser._score_one_pair( pred: Optional[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall], gt: Dict[str, Any], ) → Dict[str, float][source]#: Score a single (pred, gt) tool-call pair. pred may be None.

nemo_automodel.components.eval.tool_call_parser.compute_sample_metrics( pred_calls: List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall], gt_calls: List[Dict[str, Any]], ) → Dict[str, float][source]#

Compute per-sample tool-call metrics across all GT positions.

Predicted calls are aligned positionally against the ground-truth list: pred_calls[i] is scored against gt_calls[i]. Missing predictions (i >= len(pred_calls)) contribute zeros across every metric for that position, so a model that emits only one of two parallel tool calls is correctly penalized on the missing call.

Extra predictions beyond len(gt_calls) are ignored. All values are in [0.0, 1.0] so callers can mean() across a dataset.

Returned keys:

has_call: prediction exists at this position.
name_correct: predicted call name equals GT name.
args_json_valid: prediction had valid JSON arguments.
args_field_recall: fraction of GT argument keys present in pred.
args_field_precision: fraction of pred argument keys present in GT.
args_exact_match: pred arguments dict equals GT arguments dict.

Parameters:

pred_calls – output of :func:parse_tool_calls.
gt_calls – ground-truth list of {"name": str, "arguments": dict|str}.

nemo_automodel.components.eval.tool_call_parser#