nemo_automodel.components.eval.tool_call_parser

Generic tool-call parser for evaluating agent SFT outputs.

Different chat templates wrap tool calls in different syntax:

Qwen / Hermes / FunctionGemma / Gemma 3 / GLM-4: <tool_call>{"name": ..., "arguments": ...}</tool_call>
Llama 3.1+: <|python_tag|>{"name": ..., "parameters": ...}<|eom_id|>
Mistral: [TOOL_CALLS][{...}, {...}]
Harmony / GPT-OSS: <|channel|>commentary to=functions.NAME<|message|>{...}

This parser tries each known wrapper, then falls back to a generic JSON object scan. It is intentionally permissive: malformed JSON, missing wrappers, or unknown formats degrade gracefully and never raise.

The companion :func:compute_sample_metrics compares parser output against ground-truth tool calls and produces 0/1 (or fractional) indicators that average cleanly across a dataset.

Module Contents

Classes

Name	Description
`ParsedToolCall`	One tool call extracted from generated text.

Functions

Name	Description
`_coerce_args`	Normalize an `arguments` field to a dict.
`_coerce_gt_args`	Normalize a ground-truth `arguments` field to a dict.
`_extract_balanced`	Return the substring from `text[start]` (which must be `opener`)
`_from_call_object`	Build a :class:`ParsedToolCall` from a `{"name": ..., "arguments": ...}` dict.
`_iter_balanced_json_objects`	Yield substrings that look like balanced top-level JSON objects.
`_parse_generic_json`	Last-resort fallback: scan for any JSON object with a `name` field.
`_parse_harmony_style`	-
`_parse_llama_style`	-
`_parse_mistral_style`	-
`_parse_qwen_style`	-
`_score_one_pair`	Score a single (pred, gt) tool-call pair. `pred` may be `None`.
`compute_sample_metrics`	Compute per-sample tool-call metrics across all GT positions.
`parse_tool_calls`	Extract every tool call from a generated model response.

Data

API

class nemo_automodel.components.eval.tool_call_parser.ParsedToolCall(
    name: typing.Optional[str],
    arguments: typing.Dict[str, typing.Any],
    arguments_valid_json: bool,
    raw: str
)

Dataclass

One tool call extracted from generated text.

arguments

Dict[str, Any]

arguments_valid_json

bool

name

Optional[str]

raw

str

nemo_automodel.components.eval.tool_call_parser._coerce_args(
    args_value: typing.Any
) -> typing.Tuple[typing.Dict[str, typing.Any], bool]

Normalize an arguments field to a dict.

Accepts a dict (passthrough) or a JSON-encoded string. Returns the parsed dict alongside a flag indicating whether the source was a well-formed JSON object.

nemo_automodel.components.eval.tool_call_parser._coerce_gt_args(
    args_value: typing.Any
) -> typing.Dict[str, typing.Any]

Normalize a ground-truth arguments field to a dict.

nemo_automodel.components.eval.tool_call_parser._extract_balanced(
    text: str,
    start: int,
    opener: str,
    closer: str
) -> typing.Optional[str]

Return the substring from text[start] (which must be opener) through its matching closer, skipping over chars inside JSON strings.

Returns None if text[start] is not opener or the span is unbalanced.

nemo_automodel.components.eval.tool_call_parser._from_call_object(
    obj: typing.Dict[str, typing.Any],
    raw: str
) -> typing.Optional[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]

Build a :class:ParsedToolCall from a {"name": ..., "arguments": ...} dict.

Llama 3.1 emits parameters instead of arguments; both are accepted. Returns None when name is missing or non-string.

nemo_automodel.components.eval.tool_call_parser._iter_balanced_json_objects(
    text: str
) -> typing.Iterator[str]

Yield substrings that look like balanced top-level JSON objects.

Skips characters inside JSON string literals (so braces inside strings don’t unbalance the count). Designed for fallback extraction when no known wrapper matches.

nemo_automodel.components.eval.tool_call_parser._parse_generic_json(
    text: str
) -> typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]

Last-resort fallback: scan for any JSON object with a name field.

nemo_automodel.components.eval.tool_call_parser._parse_harmony_style(
    text: str
) -> typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]

nemo_automodel.components.eval.tool_call_parser._parse_llama_style(
    text: str
) -> typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]

nemo_automodel.components.eval.tool_call_parser._parse_mistral_style(
    text: str
) -> typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]

nemo_automodel.components.eval.tool_call_parser._parse_qwen_style(
    text: str
) -> typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]

nemo_automodel.components.eval.tool_call_parser._score_one_pair(
    pred: typing.Optional[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall],
    gt: typing.Dict[str, typing.Any]
) -> typing.Dict[str, float]

Score a single (pred, gt) tool-call pair. pred may be None.

nemo_automodel.components.eval.tool_call_parser.compute_sample_metrics(
    pred_calls: typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall],
    gt_calls: typing.List[typing.Dict[str, typing.Any]]
) -> typing.Dict[str, float]

Compute per-sample tool-call metrics across all GT positions.

Predicted calls are aligned positionally against the ground-truth list: pred_calls[i] is scored against gt_calls[i]. Missing predictions (i >= len(pred_calls)) contribute zeros across every metric for that position, so a model that emits only one of two parallel tool calls is correctly penalized on the missing call.

Extra predictions beyond len(gt_calls) are ignored. All values are in [0.0, 1.0] so callers can mean() across a dataset.

Returned keys:

has_call: prediction exists at this position.
name_correct: predicted call name equals GT name.
args_json_valid: prediction had valid JSON arguments.
args_field_recall: fraction of GT argument keys present in pred.
args_field_precision: fraction of pred argument keys present in GT.
args_exact_match: pred arguments dict equals GT arguments dict.

Parameters:

pred_calls

List[ParsedToolCall]

output of :func:parse_tool_calls.

gt_calls

List[Dict[str, Any]]

ground-truth list of {"name": str, "arguments": dict|str}.

nemo_automodel.components.eval.tool_call_parser.parse_tool_calls(
    text: str
) -> typing.List[nemo_automodel.components.eval.tool_call_parser.ParsedToolCall]

Extract every tool call from a generated model response.

Wrappers are tried in order of specificity; the first wrapper that yields any match wins. If no wrapper matches, a generic JSON-object scan is used. Returns an empty list when no plausible tool call is present.

Parameters:

text

str

raw decoded text from model.generate().

Returns: List[ParsedToolCall]

Parsed tool calls in document order. Possibly empty.

nemo_automodel.components.eval.tool_call_parser._HARMONY_ANCHOR_RE = re.compile('<\\|channel\\|>\\s*commentary\\s+to=functions\\.(?P<name>[\\w\\-]+)[...

nemo_automodel.components.eval.tool_call_parser._LLAMA_ANCHOR_RE = re.compile('<\\|python_tag\\|>\\s*')

nemo_automodel.components.eval.tool_call_parser._MISTRAL_ANCHOR_RE = re.compile('\\[TOOL_CALLS\\]\\s*')

nemo_automodel.components.eval.tool_call_parser._QWEN_RE = re.compile('<tool_call>\\s*(.+?)\\s*</tool_call>', re.DOTALL)

nemo_automodel.components.eval.tool_call_parser.logger = logging.getLogger(__name__)