nemo_automodel.components.eval.tool_call_evaluator#
Generation-based evaluator for tool-call accuracy during agent SFT.
The loss-only validation that ships with the training recipe cannot
distinguish āloss going down because the model learned the formatā from
āloss going down because the model is overfitting the response style
while emitting wrong tool namesā. This evaluator closes that gap by
running model.generate() on held-out prompts that terminate right
before an assistant tool-call turn, parsing the generated text with
- mod:
nemo_automodel.components.eval.tool_call_parser, and comparing against the ground-truth tool calls extracted from the dataset.
The evaluator is intentionally framework-agnostic: it operates on any
HuggingFace-style model with a .generate() method and a tokenizer
that supports apply_chat_template(..., tools=...). Distributed
sharding and all-reduce of metrics are left to the caller (the training
recipe), which already has the dist environment in hand.
Module Contents#
Classes#
Generation-based tool-call accuracy evaluator for agent SFT. |
Data#
API#
- nemo_automodel.components.eval.tool_call_evaluator.logger#
āgetLogger(ā¦)ā
- nemo_automodel.components.eval.tool_call_evaluator._METRIC_KEYS#
(āhas_callā, āname_correctā, āargs_json_validā, āargs_field_recallā, āargs_field_precisionā, āargs_eā¦
- class nemo_automodel.components.eval.tool_call_evaluator.ToolCallAccuracyEvaluator(
- *,
- dataset_name: Optional[str] = None,
- path: Optional[Union[str, List[str]]] = None,
- split: str = 'train',
- limit_dataset_samples: Optional[int] = None,
- max_eval_samples: Optional[int] = None,
- max_new_tokens: int = 256,
- max_prompt_tokens: Optional[int] = None,
- do_sample: bool = False,
- metric_prefix: str = 'tool_call',
- sample_shard: Optional[tuple] = None,
- raise_on_cuda_oom: bool = True,
- run_on_fsdp2: bool = False,
Generation-based tool-call accuracy evaluator for agent SFT.
The evaluator lazily loads a list of eval samples (one per assistant tool-call position in the source dataset). On each call to
- Meth:
evaluateit renders each sampleāsprompt_messagesandtoolsthrough the tokenizerās chat template, generates a continuation, parses any tool calls out of the generated text, and aggregates per-sample metrics into a corpus-level dict.
Constructor args (all keyword-only): dataset_name: HF Hub dataset id to load eval samples from. Mutually exclusive with
path. path: Local JSON/JSONL file (or list of files) to load eval samples from. Mutually exclusive withdataset_name. split: Dataset split (only used withdataset_name). limit_dataset_samples: Cap on dialogues read before expansion. max_eval_samples: Cap on total expanded eval samples. max_new_tokens: Generation budget per sample. max_prompt_tokens: If set, prompts longer than this many tokens are skipped (logged once). Prevents OOM on degenerate samples. do_sample: Generation sampling toggle. Default greedy for reproducibility across validation checkpoints. metric_prefix: Prefix applied to all returned metric keys. sample_shard: Optional(rank, world_size)tuple. When set, only everyworld_size-th sample starting atrankis processed; the caller is responsible for all-reducing the returned_countand weighted-summed metrics.Initialization
- _cleanup_cuda() None#
- _greedy_generate_manual(
- model,
- input_ids: torch.Tensor,
- attention_mask: torch.Tensor,
- max_new_tokens: int,
- eos_token_id: Optional[int],
Greedy decode using only
model.forward().Several Automodel custom model classes (notably
Qwen2ForCausalLM) inherit fromHFCheckpointingMixin + Qwen2PreTrainedModelbut not fromtransformers.generation.GenerationMixin, so the FSDP-wrapped instance has no.generate()method. We fall back to a minimal token-by-token greedy decode that only requires the forward pass to return logits. No KV cache, so cost isO(L * (P + L))per sample wherePis prompt length andLismax_new_tokensā fine for the small eval budgets used here (default 256 tokens).
- _load_samples() List[Dict[str, Any]]#
- _iter_my_samples() List[Dict[str, Any]]#
- _render_prompt_ids(
- tokenizer,
- sample: Dict[str, Any],
- skip_reasons: Optional[Dict[str, int]] = None,
Render one eval sampleās prompt through
apply_chat_template.We deliberately split the chat-template render (
tokenize=False) from the tokenization step: some templates / transformers versions return a list of token strings undertokenize=True, which then crashestorch.tensor(..., dtype=long)downstream. Going through text first sidesteps that and matches the canonical HF usage shown in the model cards.Returns
Noneif the template raises (e.g. doesnāt accept thetoolskwarg) or if the prompt exceedsmax_prompt_tokens.
- evaluate(model, tokenizer) Dict[str, float]#
Run generation-based tool-call evaluation against
model.Caller is expected to have placed the model in eval mode and on the appropriate device. The evaluator infers the device from the first model parameter so it works with FSDP, DDP, or single-GPU layouts without explicit configuration.
- Parameters:
model ā a HuggingFace causal-LM with a
.generate()method.tokenizer ā tokenizer paired with
model; must have a chat template that supports thetoolskwarg.
- Returns:
Dict of metric name -> float. All metric values are means in
[0.0, 1.0]except_countwhich is the number of samples actually scored on this rank.