nemo_automodel.components.eval.tool_call_parser
nemo_automodel.components.eval.tool_call_parser
Generic tool-call parser for evaluating agent SFT outputs.
Different chat templates wrap tool calls in different syntax:
- Qwen / Hermes / FunctionGemma / Gemma 3 / GLM-4:
<tool_call>{"name": ..., "arguments": ...}</tool_call> - Llama 3.1+:
<|python_tag|>{"name": ..., "parameters": ...}<|eom_id|> - Mistral:
[TOOL_CALLS][{...}, {...}] - Harmony / GPT-OSS:
<|channel|>commentary to=functions.NAME<|message|>{...}
This parser tries each known wrapper, then falls back to a generic JSON object scan. It is intentionally permissive: malformed JSON, missing wrappers, or unknown formats degrade gracefully and never raise.
The companion :func:compute_sample_metrics compares parser output
against ground-truth tool calls and produces 0/1 (or fractional)
indicators that average cleanly across a dataset.
Module Contents
Classes
Functions
Data
API
One tool call extracted from generated text.
Normalize an arguments field to a dict.
Accepts a dict (passthrough) or a JSON-encoded string. Returns the parsed dict alongside a flag indicating whether the source was a well-formed JSON object.
Normalize a ground-truth arguments field to a dict.
Return the substring from text[start] (which must be opener)
through its matching closer, skipping over chars inside JSON strings.
Returns None if text[start] is not opener or the span is
unbalanced.
Build a :class:ParsedToolCall from a {"name": ..., "arguments": ...} dict.
Llama 3.1 emits parameters instead of arguments; both are accepted.
Returns None when name is missing or non-string.
Yield substrings that look like balanced top-level JSON objects.
Skips characters inside JSON string literals (so braces inside strings don’t unbalance the count). Designed for fallback extraction when no known wrapper matches.
Last-resort fallback: scan for any JSON object with a name field.
Score a single (pred, gt) tool-call pair. pred may be None.
Compute per-sample tool-call metrics across all GT positions.
Predicted calls are aligned positionally against the ground-truth
list: pred_calls[i] is scored against gt_calls[i]. Missing
predictions (i >= len(pred_calls)) contribute zeros across every
metric for that position, so a model that emits only one of two
parallel tool calls is correctly penalized on the missing call.
Extra predictions beyond len(gt_calls) are ignored. All values
are in [0.0, 1.0] so callers can mean() across a dataset.
Returned keys:
has_call: prediction exists at this position.name_correct: predicted call name equals GT name.args_json_valid: prediction had valid JSON arguments.args_field_recall: fraction of GT argument keys present in pred.args_field_precision: fraction of pred argument keys present in GT.args_exact_match: pred arguments dict equals GT arguments dict.
Parameters:
output of :func:parse_tool_calls.
ground-truth list of {"name": str, "arguments": dict|str}.
Extract every tool call from a generated model response.
Wrappers are tried in order of specificity; the first wrapper that yields any match wins. If no wrapper matches, a generic JSON-object scan is used. Returns an empty list when no plausible tool call is present.
Parameters:
raw decoded text from model.generate().
Returns: List[ParsedToolCall]
Parsed tool calls in document order. Possibly empty.